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About This Manual 


Objectives 

This manual is intended to help new users of C/Paris get started quickly and be able to write 
simple but complete programs for the CM system. 


Intended Audience 

Readers are assumed to have a working knowledge of C programming and a very general under¬ 
standing of the components of the CM system. Prior knowledge of assembly-level programming 
is helpful but not required. 


Revision Information 

This is a new manual. It is a companion volume to the Paris Reference Manual , Version 5.0, and 
its updates. 


Organization 

Part I Getting Started 

These two chapters introduce C/Paris as an assembly-level subroutine library 
and provide a step-by-step explanation of a simple program. 

Part II Basic Concepts and Techniques 

These three chapters introduce the basics of C/Paris and data parallel pro¬ 
gramming. They describe simultaneous computations within many processors 
at once, Paris-level memory management, the data parallel techniques of pro¬ 
gram control, and the means of configuring sets of virtual processors to ex¬ 
press large data sets. 

Part III Interprocessor Communications 

These two chapters introduce communications among CM virtual processors. 
Once a program has configured virtual processors to express both the size and 
the shape of various data sets, the processors can send intermediate results to 
each other in a variety of patterns, and the program can perform cumulative 
operations along any of the dimensions of a data set. 
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About This Manual 


Part IV Commands and Utilities 

These two chapters outline the procedures for compiling and executing a C/ 
Paris program and introduce some utilities for debugging, run-time safety 
checking, and monitoring program execution time. 


Part V Examples 

The appendixes present six complete example programs. 


Associated Documents 

• CM User’s Guide: UNIX System Front End, Connection Machine documentation set 

• Paris Reference Manual, Version 5.0, Connection Machine documentation set 


Associated On-Line Directory 

• /cm/src/cparis-examples 

All the example programs shown in this manual appear in the above directory under the file 
name given in the example’s caption. A Makefile is included. (Check with the site system man¬ 
ager if the example directory has been placed in another location.) 


Notation Conventions 


Convention Meaning 


boldface 

boldface 

italics 

typewriter 

% boldface 

typewriter 


UNIX and CM System Software commands, command op¬ 
tions, and file names. 

C/Paris and C language elements, such as keywords, opera¬ 
tors, and function names, when they appear embedded in 
text. 

Parameter names and placeholders in function and com¬ 
mand formats. 

Code examples and code fragments. 

In interactive examples, user input is shown in boldface and 
system output is shown in typewriter font. 



Part I 

Getting Started 




Chapter 1 

What Is Paris? 


The data parallel programming model assumes an array of small processors, each with 
some associated memory, all acting under the direction of a distinguished processor 
called the front end. In the Connection Machine system, the front end is a standard 
serial computer—a Sun-4 or certain models of VAX— with a bus interface to the proc¬ 
essor array within the CM itself. 

The essence of the programming model is that each processor stores the information 
for one data point in its local memory, and then all processors perform the same opera¬ 
tion on all the data points at the same time. For instance, a text retrieval program 
might store articles one-per-processor and then have each processor search its article 
for a keyword. Similarly, a graphics program might store pixels one-per-processor and 
then have each processor compute the color value for its pixel. 



Figure 1. Interactions between front end and CM 
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Programming in C!Paris 


Refinements of the programming model allow the user to direct that only a selected set 
of processors perform a given operation. In the text retrieval program, for instance, the 
processors that find the key word might be instructed to search further for another key 
word, while those that did not find the key word remain idle. Another refinement en¬ 
ables processors to pass messages to each other. For instance, color shading in a 
graphic image requires that each processor obtain surface information from sur¬ 
rounding pixels (processors) in order to calculate its result. 

The front-end computer runs a serial program that explicitly “parallelizes” certain op¬ 
erations by transferring data and instructions over the bus to the CM processors. The 
instructions might include: 

• The data parallel equivalent of serial looping operations, as in the text retrieval 
example just mentioned. These kinds of instructions are similar to the arithme¬ 
tic and relational operations of a serial computer, except that they are executed 
by many processors simultaneously. 

• Directions to the CM processors to compute some value that determines 
whether they are to participate in the next instruction. 

• Directions to the CM processors to communicate with each other, as in the 
color-shading example. These kinds of instructions cause each processor to 
send (or get) some intermediate result from another processor before proceed¬ 
ing to the next instruction. 

• Directions to the CM processors to return their results (or some aggregate re¬ 
sult) to the front end. 

This set of instructions by which the front end directs the actions of the processor ar¬ 
ray is Paris, the Connection Machine PARallel Instruction Set. 


NOTE on I/O 

Paris instructions transfer data into and out of CM memory 
only by way of the front end. Specialized CM systems exist for 
transferring data directly between CM memory and peripheral 
storage or graphic display devices. See the volumes Connection 
Machine HO Programming and Connection Machine Graphics 
Programming. 
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1.1 What Is C/Paris? 

Paris is a low-level protocol in which the user can write data parallel programs for the 
Connection Machine system. It exists as three interfaces—C/Paris, Lisp/Paris, and 
Fortran/Paris—which are essentially subroutine libraries in the languages. 


A Subroutine Library 

C/Paris should be seen as a library rather than as a full language—it cannot be used 
independently of C (or some other language with a suitable interface). A C/Paris pro¬ 
gram is written in C, with Paris calls as needed to express parallel operations. 

It is important to recognize that a CM program is directing two different computers: 
the serial front end and the parallel CM. The higher-level CM languages mask this 
duality somewhat, but at the Paris level it is always explicit: 

• C code directs front-end (serial) operations, including the manipulation of se¬ 
rial data on the front end and all control flow in the program. 

• Paris calls direct only the handling of data by CM processors and the transfer 
of data between the front end and the CM. 


An Assembler-Like Language 

Paris is intended primarily as a base upon which to build the higher-level CM lan¬ 
guages—C*, CM Fortran, and *Lisp. For instance, the C* compiler generates serial C 
code with calls to C/Paris routines. This output is passed to the front end’s C compiler, 
which proceeds in the normal way to produce an executable load module. 

Users can, however, gain finer control over program behavior by calling Paris routines 
directly, either from within C* or from within an ordinary C program. As with any 
low-level language, the user is trading off the convenience of high-level language ab¬ 
stractions for program efficiency. 
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Programming in ClParis 


1.2 A Program Template 

Every C/Paris program must contain two particular lines of code, as shown in this pro¬ 
gram template: 


Example 1. C/Paris program template: template.c 


#include <cm/paris.h> 

main() 

{ 

CM_init() ; 

/* [Ccode] */ 

/* [ Paris function calls occurring within C code] */ 
/* [Ccode] */ 

} 


The #include directive must occur at the beginning of any C/Paris program. The 
header file paris. h sets up the C/Paris programming environment by declaring the 
function names and defining data types and certain CM global configuration variables 
used by Paris. 

The C/Paris routine CMJnit, which takes no arguments, must appear in the program 
before any other calls to Paris instructions. It has two effects: 

• CMJnit initializes the values of the global configuration variables, such as: 

• CM_physical_processorsJimit, determined by the size of the CM that 
is executing the program 

• CM_maximumJntegerJength, determined by implementation re¬ 
strictions, if any, in the version of Paris being used 

Programs can use these variables (rather than constant values) to ensure port¬ 
ability across CM software versions and hardware configurations. Programs 
should never set the values of CM configuration variables. 
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• CM Jnit also warm boots the CM. Warm booting prepares the system for the 
upcoming program by clearing the queue for the instruction bus between the 
front end and the CM, clearing error status indicators, and initializing certain 
system memory areas in the CM. (The user memory in the CM processors is 
not affected.) 

Warm booting should not be confused with cold booting, a more thorough in¬ 
itialization that completely resets the state of the CM hardware and clears user 
memory. Warm booting is always done from within a C/Paris program; cold 
booting is done only from the UNIX shell, using procedures described in the 
volume CM Front-End Subsystems. 

The template shown is in fact a working C/Paris program—the simplest program pos¬ 
sible. Its effects, however, are limited to changes in CM state that would not be visible 
to the programmer. 

The kinds of C/Paris calls that do produce useful results are the subject of the remain¬ 
der of this manual. 




Chapter 2 

A Simple Program 


This chapter examines a very simple C/Paris program: one that simply adds two inte¬ 
ger constants within each processor. The purpose of the exercise is to illustrate the 
basic features of parallel instructions and parallel data, ignoring for the moment the 
details and refinements. 

The program is the parallel analogue of this trivial C program: 


Example 2. A simple C program: add-scalar-constants, c 


#include <stdio.h> 

main() 

{ 

int a, b, sum; 

a = 2; 
b = 3; 

sum = a+b; 

printf( "\nThe sum of a and b is %d.\n", sum ); 

} 


In this program, a, b, and sum are scalar (single) values that reside on the front end. 
Now, consider the same operations in add-parallel-constants.c, where the three vari¬ 
ables indicate multiple values in CM processors’ memories. 
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Programming in C/Paris 


After showing the C/Paris program, the remainder of this chapter walks through the 
program step by step, pointing out some basic features of C/Paris programming and 
the ways in which it differs from serial C programming. 


NOTE 

This program, and all complete programs shown in this man¬ 
ual, can be found on line in /cm/src/cparis-examples (or an¬ 
other location determined by the site system manager). 


Example 3. A simple C/Paris program: add-parallel-constants.c 


#include <cm/paris.h> 

#include <stdio.h> 

#define LEN 32 

main () 

{ 

/* 

/* 1. Declare variables and warm boot CM */ 


CM_field_id_t field__a, field_b, field_sum; 
int single__sum, agg_sum; 

CM_init() ; 


/* =======================:================================* / 

/* 2. Allocate CM memory fields */ 


field_a = CM_allocate_heap_field( LEN ); 
field_b = CM_allocate_heap_field( LEN ); 
field_sum = CM_allocate_heap_field( LEN ); 
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/* 3. Specify that all processors participate */ 

CM_set_context(); 

/ * - - ===== - / 

/* 4. Put values into the two "source'* fields */ 

CM_s_m ov e_const an t__lL( field__a, 2, LEN ); 
CM_s_m° v e_c on st an t__lL ( field_b, 3, LEN ); 

/* =======================================================*/ 

/* 5. Add the two source fields in each processor and 
place the result in the "destination" field */ 

CM_js_add_3_lL( field_sum, field_a, field_b, LEN ); 

/« =======================================================* / 

/* 6. Read the value from the destination field in a single 
processor and print it. */ 

single_sum = 

CM_s_read_from_processor_lL( 0, field_sum, LEN ); 
printf( "\nThe sum in processor 0 is %d.\n", single__sum ); 

/* 7. Add the destination fields in all processors, return 
the aggregate value to the front end, and print it */ 

agg_sum = CM_global_s_add_lL( fieldjsum, LEN ); 
printf( "\nThe sum of all the sums is %d.\n", agg_sum ); 

/* =======================================================*/ 

/* 8. Deallocate CM memory fields */ 

CM__deallocate_heap_f ield( field_a ) ; 

CM__deallocate_heap_f ield( field_b ) ; 

CM__deallocate_Jieap_f ield( field_sum ) ; 

> 
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Programming in C/Paris 


The most immediately obvious difference between this program and its serial C ana¬ 
logue is the sheer length of the C/Paris program. Its length is, however, largely a result 
of its being low-level rather than of being parallel. The higher-level CM languages, such 
as C*, can express parallel operations nearly as economically as C expresses serial op¬ 
erations. This program is an example of the trade-off between convenience and fine- 
tuned control that we see in comparing any high-level and low-level languages. 


2.1 Allocating and Referencing Parallel Data 

User data is always stored in fields in CM memory. A field is simply one or more con¬ 
tiguous bits that start at the same bit location in every processor. A CM field is roughly 
analogous to a C variable in that it is associated with an address in memory. However, 
there are two significant differences: 

• Unlike a C variable, a CM field must be allocated explicitly. The allocation 
instructions return a field-id, which contains the address of the allocated field 
along with other information. The program can (optionally) assign the field-id 
to a front-end variable of type CM_field_id_t. 

• Fields are not associated with types. The length of a field (in bits) is specified 
when the field is allocated. Later, data of any supported CM type—signed or 
unsigned integer or floating-point number—can be placed in any field. 

In the example program. Step 2 allocates fields and assigns them to front-end variables 
that have been declared in Step 1. 


CM_field_id_t field_a; /* Step 1 */ 

field_a = CM_allocate_heap_field( LEN ); /* Step 2 */ 

These two steps allocate memory to hold parallel data and provide a front-end variable 
with which to reference that portion of memory. The “stripe” of memory allocated con¬ 
sists of 32 contiguous bits that begin at the same bit location in every processor. All 
Paris instructions that operate on user data, including those that move data into fields, 
take one or more field-id’s as operands. 

Figure 2 shows the state of CM system memory at this point in the program. The pro¬ 
gram calls the allocation instruction three times to allocate three fields, and it per¬ 
forms three assignments to previously declared front-end variables. 



11 


Chapter 2. A Simple Program 



Figure 2. Allocating a field and returning a field-id 


Note that the last step of the program explicitly deallocates the fields. 

CM_deallocate_heap_field( field_a ); /* Step 8 */ 

This step is not strictly necessary in the example program, since the fields would be 
deallocated at program termination in any case. However, as in C, it is generally good 
practice to free up memory by deallocating items that are no longer needed. 


2.2 Setting CM Context 

Most CM programs use all the processors for some instructions and a selected subset 
of them for other instructions. For instance, in the text retrieval example given in the 
previous chapter, all processors execute the first instruction to search for key-word-1, 
but only those that met success execute the next instruction to search for key-word-2. 

The set of processors that are to execute the next instruction is called the selected set or 
active set. At any given time, the program’s context may change to a different selected 
set. At the Paris level, context is always set and reset explicitly. Step 3 of the example 
program is a Paris call that selects all processors; this context is retained throughout 
the program. 


CM_set_context(); 


/* Step 3 */ 
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Programming in C!Paris 


2.3 Moving Data into Fields 

When fields are allocated, they contain no useful data. This program illustrates one of 
several ways to initialize fields, in this case with signed integer constants. The 
CM_move family of instructions copy data into a field, either from another field or 
from the front end. If the front-end data to be moved is a constant or zero, the front end 
“broadcasts” a copy of the quantity to the specified field in each processor. 

CM_s_move_constant_lL( field_a, 2, LEN ); /* Step 4 */ 

CM_s_move_constant_lL( field_b, 3, LEN ); 

Like all Paris instructions that operate on fields, this instruction takes a field-id as an 
operand to indicate the bit address at which to begin storing the constant (2 or 3). In 
addition, when calling any Paris instruction that operates on fields, the program must 
specify the number of bits (LEN = 32) on which to operate. In this program, the number 
corresponds to the allocated length of the field. (Chapter 3 shows cases where the 
number might be smaller than the field.) 



Notice that it is the instruction, not the field, that expects a certain type of data. The 
element _s in the instruction name indicates that the instruction treats its data oper¬ 
and as a signed integer. (Alternatively, the element could be _u or _f, for unsigned inte¬ 
ger or floating-point number.) However, there is nothing analogous to type checking in 
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C/Paris. If the constant provided were 2.0, the instruction would simply take it to be a 
signed integer and blithely put the wrong value in the specified field. 


2.4 Computation on Fields 

Given the development of the program so far, the instruction that actually performs 
the computation is straightforward: 

CM_s_add_3_lL( field_sum, field_a, field_b, LEN ); /* Step 5 */ 

The destination field is always specified first in a Paris call, followed by one or more 
source fields as appropriate to the instruction. In this case, the instruction adds the 
values in field_a and field_b, treating them as signed integers, and places the result in 

field_sum. 

Note the element _3 in the instruction name. This element indicates that the instruc¬ 
tion takes three field operands: two source fields and a different destination field. An 
alternative that Paris provides is CM_s_add_2_1L, which places the result back into 
one of the source fields. For instance, the following call increments the value in f ield_a 
by the value in field_b: 

CM_s_add_2_lL( field_a, field_b, LEN ); 

The last element of the instruction name, _1L, indicates the number of different 
lengths of the operands. This example used the _1 L version because the three fields are 
all of the same length (32 bits) and the program uses the full length of the fields. An 
alternative is the _3L version, used when the operands are of different lengths. In the 
_3L version, all the lengths are specified and in the same order as the field operands. 
For instance: 

CM_s_add_3_3L( field_sum, field_a, field_b, 32, 16, 16 ); 

These variations on the theme of add illustrate the major reason for the large size of the 
Paris instruction set. The CM_add family of instructions includes variants for _u, _s, 
and _f numbers; within each of these, there are variants for adding constants or adding 
variables; and within each of these, there are variants for _2 and _3 field operands and 
for _1 L and _3L different lengths. The number of variations is a reflection of the low 
level of the Paris instructions: each of these subtly different system operations is ex¬ 
pressed by a different Paris instruction. 
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2.5 Moving Data to the Front End 

The results computed by the CM processors are of no use to the main program until 
some result is returned to the front end. 

The example program shows two of the several ways in which the front end retrieves 
results from the CM processors. In both these cases, the data retrieved is a scalar, not 
parallel, quantity. 


From a Single Processor 

The first method of moving data from CM memory to the front end involves reading a 
field’s contents from one specified processor (see Figure 4). The value returned is then 
assigned to a front-end variable. 

single_sum = /*Step 6*/ 

CM_s_read_from_processor_lL( 0, field_sum, LEN ); 

The first argument is a unique processor-id, or send address. Send addresses are more 
commonly used when processors send messages to each other. However, the front end 
can use a send address to access a value from a single CM processor. 



Figure 4. Returning one processor’s result to the front end 
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Send addresses, and the means of computing them, are introduced in Chapter 7. Paris 
does not guarantee that send addresses are consecutive integers, but we can be confi¬ 
dent that there will always be a processor numbered 0. We can thus check one result of 
a parallel computation by using CM jype_read_f rom_processor_1 L. 


From a Computation across Processors 

A second method of retrieving scalar data from the CM processors is to perform some 
combining operation across all the values in the specified field and return the single 
result to the front end (see Figure 5). Paris provides several such instructions, all with 
the element _global in their names. 



In the example shown, Step 7 adds the values across f ield_sum and returns their aggre¬ 
gate sum to the front end. The result is then assigned to the front-end variable 

agg_sum. 

agg_sum = CM_global_s_add_lL( field_sum, LEN ) /* Step 7 */ 

The value returned in this example is n x 5, where n is the number of processors partici¬ 
pating. For instance, imagine that the program is executing on a 16K CM system. The 
16,384 instances of 5 are added and the total 81920 is returned. If the program executes 
on an 8K CM system, the aggregate sum is 40960. 
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2.6 Compiling and Executing the Program 

A C/Paris program is compiled with the front end’s C compiler cc and executed with 
the CM System Software command cmattach. The detailed procedures for compiling 
and executing a C/Paris program are described in Chapter 8. 

Those who have faith can simply type the following command lines: 

% cc add-parallel-constants, c -Iparis -Im 
% cmattach -w a. out 

Attaching the Connection Machine system [ name ] . . . 
coldbooting... done. 

Attached to 8192 physical processors 

The sum in processor 0 is 5. 

The sum of all the sums is 40960. 

Detaching... done. 

% 


NOTE 

The command lines just shown apply to Paris Version 5.0. In 
later versions, please consult the current CM documentation 
to determine whether these command options have changed. 
In particular, the means of specifying the Paris library on the 
cc command line may change. 





Part II 

Basic Concepts and Techniques 





Chapter 3 

Computing within Processors 


The vast majority of Paris instructions cause some computation to occur within each 
selected processor, independently of the others. Most of these instructions are arith¬ 
metic and relational operations that will be familiar to assembly-language program¬ 
mers and not surprising to C programmers. 

The major differences between intraprocessor Paris instructions and the analogous C 
operations arise from the low level of Paris instructions, rather than from parallelism: 

• The data on which Paris operates, the CM data formats, are unlike C types. 

• Working with CM memory fields requires more explicit storage management 
than working with C variables. 

This chapter illustrates intraprocessor Paris instructions, although it does not attempt 
to give an exhaustive listing of them. Rather, the focus is on using CM data formats and 
managing CM memory fields in the course of executing intraprocessor instructions. 


NOTE 

The emphasis in this chapter is on differences from C pro¬ 
gramming. Users who are experienced in assembly-language 
programming can skim quickly over this chapter. 
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3.1 What Is a “Processor”? 

This chapter adopts an artificially constrained programming model: the number of 
processors is assumed to be equal to the size of the physical CM on which a program 
executes, and the processors do not exchange data with each other. 

In fact, the “processors” referred to in this chapter are not physical CM processors. 
Paris supports a virtual processing mechanism whereby each physical processor simu¬ 
lates some number of virtual processors. Further, the virtual processors can be logi¬ 
cally configured into a variety of shapes of up to 31 dimensions, which facilitates the 
exchange of data between logical “neighbors.” Chapter 5 introduces the virtual proc¬ 
essing mechanism, along with the methods for configuring the virtual machine. 

The programs shown in the present chapter make use of the default virtual processor 
configuration, which is a 2-dimensional array of processors the same size as the physi¬ 
cal machine. The procedures introduced here for intraprocessor computations and 
storage management are the same in any configuration of virtual processors. 


3.2 Arithmetic and Relational Instructions 


Paris provides a large selection of arithmetic and relational instructions, analogous to 
those found in the instruction sets of many serial computers. These include: 

• Binary arithmetic, such as add, subtract, multiply, divide, max, min, truncate, 
round, rem, mod, power, and others. 

• Unary arithmetic, such as negate, sqrt, abs, signum, integer-length, logcount, 
floor, ceiling, truncate, as well as transcendental and trigonometric functions 
sin, cos, tan, sinh, cosh, tanh, exp, and In. 

• Bitwise booleans on two operands, logand, logior, logeqv, lognand, lognor, 
logandcl, logandc2, logorcl, logorc2, as well as lognot for one operand. 


Instruction Variants and Names 

Each of the intraprocessor instructions takes as operands: 

• The field-id’s of one or more source fields and a destination field (which may 
be the same field as one of the source fields) 

• One or more length specifiers 
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Also, most of the instructions have variants for signed and unsigned integers and for 
floating-point numbers. A few variants are not provided either because they are non¬ 
sensical (there is no unsigned abs, for example) or because they are not generally useful 
(for instance, the trigonometric functions are provided for floating-point data only). 

The user can recognize the action of a Paris instruction—and often predict the exis¬ 
tence of related instructions—by parsing its name: 



At this point, it would be helpful for the new Paris user to scan the early sections of 
Chapter 5, “Instruction Set Overview,” in the Paris Reference Manual, Version 5.0. 
This chapter lists all Paris instructions by name under various functional categories 
and identifies the variants of each instruction. 


Example of Intraprocessor Computations 

The following example, unsigned-arithmetic.c, illustrates some of the intraprocessor 
instructions in action. The instruction variants shown all place their results in a sepa¬ 
rate destination field, rather than back into a source field. In a real program this would 
be a profligate use of memory, but it does facilitate showing the results of all the opera¬ 
tions in a single snapshot (Figure 7). 
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This example also shows some additional methods for getting data into and out of CM 
memory, beyond those shown in earlier examples: 

• For input, the example uses a move instruction (Step 1), an implicit move in¬ 
struction (Step 4), and the unsigned variant of random, which places random 
integers up to a specified limit into the destination field (Step 1). 

• For output, the example uses two _global reduction instructions (Steps 7 and 
8). These instructions are available as add, multiply, max, and min, each for 
_u, _s, and _f data, as well as logand, logior, and logxor. 


Example 4. Computing within processors: unsigned-arithmetic.c 


#include <cm/paris.h> 
#include <stdio.h> 

^define LEN 8 


main () 

{ 

CM_field_id_t a, b, c, d, e, f, g; 
unsigned int max_value, min_value; 
char *string; 

CM_init(); 

a = CM_allocate_heap_field( LEN ); 
b = CM_allocate_heap_field( LEN ); 
c = CM_allocate_heap_field( LEN ); 
d = CM_allocate_heap_field( LEN ); 
e = CM_allocate_heap_field( LEN ); 
f = CM_allocate_heap_field( LEN ); 
g = CM_allocate_heap_field( LEN ); 


CM_set_context(); 
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/* 1. Initialize fields a and b */ 

CM _u_random__lL( a, 100, LEN ); 

CM_ u _m° v e_ con stant_lL( b, 2, LEN ); 

/* =======================================================* / 

/* 2. Compute the max of a and b */ 

CM _ u _ max _ 3 _lL( c * a, b, LEN ); 

/ * = „„ ================================== = ======:===========:==== : ===5 K / 

/* 3. Multiply a by b */ 

CM_u_multiply_3_lL( d, a, b, LEN ); 

/* 4. Multiply a by the constant 2 */ 

CM_ u _multiply_jconstant^;3__1L( e, a, 2, LEN ); 

/* 5. Divide a by b and round the result toward zero */ 
CM_u_t r uncate_3_lL( f, a, b, LEN ); 

/ * ====== = ========= = === = = ===:== ============:=:======== = ====== * / 

/* 6. Take the integer square root of f */ 

CM_u_isqrt_2_lL( g, f, LEN ); 

/* 7. Find the maximum value in g, return it to the 
front end, and print it */ 


max_value = CM_global__u_max__lL( g, LEN ); 

printf("The largest value in field g is %d.\n", max_value); 
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/* 8. Determine whether any of the values in g is zero 
and print that information */ 

min_value = CM_global_u_min_lL( g, LEN ); 
string = min_value ? " not” : ,,M ; 

printf( "Field g does%s contain a zero.\n", string ); 

} 


Note in Step 8 that a CM integer of length 8 is assigned to a front-end unsigned int 
(which is of length 32 on VAX and Sun). The two numbers need not be of the same 
length. The only constraint is that the front-end type should be large enough to hold the 
value retrieved from the CM. In this example, the returned value would fit comfortably 
into a char. 

The changes that occur in CM memory from this set of computations are shown in 
Figure 7, which depicts an arbitrary set of five processors. 
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Figure 7. Changes in CM state from unsigned-arithmetic.c 
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3.3 Data Formats 

User data on the CM is always stored in bit fields, that is, in sets of contiguous bits that 
begin at the same memory location in each processor and extend for some arbitrary 
length. However, many Paris instructions interpret bit fields as being of certain data 
“types” or storage formats. The currently supported formats are: 

• Unsigned integer, represented in straight binary form 

• Signed integer, represented in two’s-complement form 

• Floating-point number, represented in three subfields (for significand, expo¬ 
nent, and sign) in a format like the IEEE standard 

The format of the data in any given field is determined by the instruction that moves 
data into that field, rather than by any feature of the field itself. 


Operand Lengths 

Each CM processor is a one-bit serial processor. Since the basic granule of operation 
is a bit—rather than a byte or a word, as in more complex processors—Paris does not 
enforce any length or alignment requirements on data formats. In this respect, CM 
data formats are completely unlike C types. 

Any of the CM “types” or formats can be of almost any length: 

• Signed integers can be any length from 2 bits up to the value of the CM configu¬ 
ration variable CM_maximum Jnteger Jength, which is version-dependent but 
never less than 128 bits. 

• Unsigned integers can be any length up to the value of CM_maximum_inte- 
gerjength. 

• The subfields for floating-point numbers can be: 

• Significand: from 1 bit up to the value of the configuration variable 
CM_maximum_significand_length, which is never less than 96 bits. 

• Exponent: from 2 bits up to the value of the configuration variable 
CM_maximum_exponent_length, which is never less than 32 bits. 


Sign: always 1 bit. 
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To correspond with IEEE single- and double-precision floating-point formats, 
the allocated field should be 32 bits or 64 bits, and the length arguments in 
procedure calls should be specified as 23,8 or 52,11, respectively. (The last bit 
is the sign bit.) Thus: 

field_name = CM_allocate_heap_field( 32 ); 

CM_f_random_lL( field_name, 23, 8 ); 

or, 

field_name = CM_allocate_stack_field( 64 ); 

CM_f_random_lL( field_name, 52, 11 ); 

The choice of length for a field depends primarily on the size of the numbers it is to 
contain and the degree of precision desired. Specifically, dynamic range increases with 
field size for integers and floating-point exponents; precision increases with field size 
for floating-point significands. Other considerations in the choice of field length are: 

• Memory usage. Like a C type, a CM field need be no longer than the number of 
bits required to represent its largest value. For instance, the program shown in 
Example 4, unsigned-arithmetic, c, uses 8-bit fields because the largest value 
to be represented is 198. (The exclusive upper bound of the random number 
initialization is 100, and the only increase in value is multiplication by 2.) 

• Overflow. Shorter lengths run the risk of overflow, where the value moved into 
a field is too large to be represented in that field. Most Paris instructions that 
operate on fields set a state bit called the overflow flag when overflow has oc¬ 
curred. (The means of checking for overflow is described in Chapter 4.) 

For instance, the following call results in an overflow condition, producing an 
undefined result and setting the overflow flag as a side effect in each active 
processor: 

CM_s_raove_constant_lL( dest_field, 1000000, 8 ); 

• Execution time. The extra bits needed to guarantee against overflow are not 
“free.” Since the CM processors act upon one bit at a time in a serial fashion, 
there is an at least linear increase in processing time as field length increases. 

• Use of floating-point hardware. For most data formats, length does not affect 
portability across CM system components, such as the various supported 
front ends (barring overflow, of course). However, on systems equipped with 
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an optional 32-bit floating-point accelerator, floating-point operands must be 
of length 23, 8 to be executed on hardware. 


Format Conversions 

A very important difference between C and C/Paris is that there is no type checking 
and no coercion in C/Paris. Paris instructions simply take the values that are passed to 
them and treat them as the format they are designed to operate upon. It is entirely the 
responsibility of the program to ensure the compatibility of operands. 

Because Paris has no strong typing, programs are vulnerable to two kinds of errors 
that do not ordinarily occur in C: 

• No type checking. If the operand format is not what an instruction expects, the 
instruction simply produces incorrect results. For instance: 

CM_s_move_constant_lL( field_a, 5, 32 ); 

CM_f_negate_l_lL( field_a, 23, 8 ); /* wrong results */ 

After broadcasting an integer constant into field_a, this code negates what it 
takes to be a floating-point number, obediently breaking it into subfields and 
so on. The result is neither the integer value -5 nor the floating-point value 

-5.0. 


• No coercion. If operands are incompatible with each other, the instruction sim¬ 
ply produces incorrect results. For instance: 

CM_s_move_constant_lL( field_f, 5, 32 ); 
CM_f_move_constant_lL( field_g, 5.0, 23, 8 ); 

CM_s_add_2_lL( field_f, field_g, 32 ); /* wrong results */ 

Paris programs must convert operands explicitly to the desired format, using the 
unary conversion instructions provided. These instructions have two “type” elements 
in their names, signifying the result format and the operand format, respectively. 

CM_f_s_float_2_2L dest source source-len dest-sig-len dest-exp-len 
CM_f_u_float_2_2L dest source source-len dest-sig-len dest-exp-len 
CM_s_f_floor_2_2L dest source dest-len source-sig-len source-exp-len 
CM_s_f_tru n cate_2_2 L dest source dest-len source-sig-len source-exp-len 
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Correct versions of the above code fragments might be the following (assuming that all 
the needed fields have been previously allocated and assigned): 

CM_s_move_constant_lL( field_a, 5, 32 ); 

CM_f_s_float_2_2L( field_b, field_a, 32, 

CM_f_negate_l_lL( field_b, 23, 8 ); 

and, 

CM_s_m°ve_ con stant_lL( field_f, 5, 32 ); 
CM_f_move_constant_lL( field_g, 5.0, 23, 

CM_s_f_floor_2_2L( field_h, field_g, 32, 

CM_s_add_2_lL( field_f, field_h, 32 ); 


3.4 More on Fields 

Paris programs can use two kinds of data storage, heap fields and stack fields. The dis¬ 
tinction between the two is analogous to the distinction between static and automatic 
variables in C. 

In addition, programs can divide a field into subfields by means of an offset instruc¬ 
tion, and then treat the subdivided field somewhat like a structured data type. 


23, 8 ); 

/* correct */ 


8 ); 

23, 8 ); 

/* correct */ 


Using Heap Fields 

Heap fields are the “general-purpose” form of CM storage. These fields are intended 
to be used like C global variables: that is, they have indefinite extent, and they can be 
used within any lexical scope. Heap fields must be explicitly allocated, and they can be 
deallocated at any time and in any order. 

The commonly used instructions that pertain to heap fields are: 

CM_allocate_heap_field length 
CM_deallocate_heap_fie!d field-id 
CM_is_field_in_heap field-id 

The first instruction takes a length in bits, and returns a field-id to the front end. The 
second two take a field-id argument and perform the action indicated. Notice that the 
space allocator makes no reference to the format of the data the field will contain. Any 
field can be used for any CM data format. 
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Using Stack Fields 

Stack fields, like C local variables, are intended for use within some bounded lexical 
scope such as the body of a procedure. However, stack fields do not simply disappear 
when the procedure is exited. Stack fields must be explicitly deallocated and in the 
reverse order from which they were allocated. Deallocating a stack field causes all 
stack fields that were allocated later to be deallocated. 

The commonly used instructions that pertain to stack fields are: 

CM_allocate_stack_field length 
CM_deallocate_stack_through field-id 
CM_is_field_in_stack field-id 

CM_is_stack_fi eI d_newer stack_query_field stack_base_field 
An example of stack fields in use is the following: 


Example 5. Using a stack field: swap-signed-integers, c 


#include <cm/paris.h> 

swap_signed_integers( a, b, len ) 

CM_fie1d_id_t a, b; 
unsigned int len; 

{ 

CM_fie 1 d__id_t temp; 

temp = CM_allocate_stack__field ( len ); 

CM__s_move_JLL( temp, a, len ); 
CM_s__move_lL( a, b, len ) ; 
CM_s__m°ve_lL( b, temp, len ); 

CM_deallocate_j5tack_through( temp ) ; 

} 


Notice that the stack field is deallocated before the procedure exits. This is the way 
that C compilers make local variables “automatically” disappear. Notice also that the 
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same front-end type, CM_field_id_t, is used to store the ID’s of both stack fields and 
heap fields. 


Storage Management 

The difference in the recommended use between heap fields and stack fields does not 
arise from differences in extent: both heap fields and stack fields persist until they are 
deallocated (or until the program terminates). The difference in recommended use 
arises from the way the system manages the two kinds of memory: 

• Heap fields, if explicitly deallocated, can be deallocated in any order without 
side effects on other heap fields. 

• Stack fields should be explicitly deallocated, and they must be deallocated in 
the reverse order in which they were allocated. If a stack field is not deallocated 
explicitly, it may be deallocated as a side effect of deallocating an earlier stack 
field. 

When fields are allocated, storage is reserved for heap fields and stack fields at oppo¬ 
site ends of CM memory (see Figure 8). The stack is managed with the standard Last 
In First Out (LIFO) stack protocol: new stack fields are always allocated—and always 
deallocated—at the top of the stack. As a result, the stack remains packed. 

Heap fields, in contrast, are allocated in the first available space. Early in the program, 
the space is probably the top of the heap. However, since heap fields can be deallo¬ 
cated in any order, gaps tend to form over time and later fields may be placed in these 
gaps. See Figure 8 for the result: the stack is packed and the fields appear in the order 
allocated, but the heap is fragmented and the fields are not in the order allocated. 

The consequences of these features of storage management for system efficiency are: 

• Stack storage is always space-efficient, whereas the space efficiency of heap 
storage tends to deterioriate over time. 

• Allocating a stack field is somewhat faster than allocating a heap field, de¬ 
pending on the degree of fragmentation of the heap. However, the difference is 
not as great as between static and automatic variables on a serial machine. 
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Figure 8. State of CM memory at a midpoint in program execution 


3.5 Bit-Field Operands 

Paris distinguishes between a data operand and the field that contains it. A field is 
simply some number of bits that have been allocated and which can therefore be ac¬ 
cessed. A data operand, however, is the set of bits specified in a call to a Paris routine. 
The bit-field operand should fall entirely within an allocated field, but it need not be 
coterminous with the field. 


Field-id’s and Lengths 

A bit-field operand is specified by: 

• A “pointer” to the bit address at which to begin operating in each processor. 
The field-id indicates the first bit of an allocated field. 

• The number of bits on which to operate. This length specifier can be any value 
between the minimum length of the data format and the length of the field. 
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NOTE 

Although field-id’s are associated with CM memory addresses, 
they are not C pointers and programs should never attempt to 
use them for indirect addressing. To be precise, in the present 
implementation of Paris, field-id’s are indices into a front-end 
array of structures, where each structure describes a CM field. 


A bit-field operand need not begin at the beginning of an allocated field. The following 
instruction takes a field-id and an unsigned integer offset and returns a new field-id. 

CM_add_offset_to_field_id field-id. offset 


As example layouts, consider a 32-bit field that may contain either an integer or a float¬ 
ing-point number. The address indicated by the field’s ID is the first bit of the field (bit 
0), as shown in Figure 9. This is the least significant bit of the integer or of the sig- 
nificand. The number’s sign, if any, is stored in the last bit (bit 31). If the length 
specifiers of the floating-point number were 23 and 8, the exponent begins at bit 23. 
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Figure 9. Field-id and data layout in a 32-bit field 
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By adjusting the starting point and the length, a program can extract the constituent 
subfields of this 32-bit floating-point number. The following fragment copies the sig- 
nificand into field b, the exponent into field c, and the sign into field d. 

a = CM_allocate_heap_field( 32 ); 
b = CM_allocate_stack_field( 32 ); 
c = CM_allocate_stack_field( 8 ); 
d = CM_allocate_stack_field( 1 ); 

CM_f_move_constant_lL( a, 5.86, 23, 8 ); 

CM_u_move_lL( b, a, 23 ); 

CM_u_move_lL( c, CM_add_offset_to_field_id( a, 23 ), 8 ); 
CM_u_m°ve_lL( d, CM_add_offset_to_field_id( a, 31 ), 1 ); 

Similarly, by adjusting starting points and lengths, programs can use CM_move in¬ 
structions to perform bit shifts, as shown in Example 6. These shift routines move 
specified bits from a source field into a position in the destination field that is offset to 
the right or left, and then move zeros in to fill the offsets. 

For instance, Figure 10 shows the change that results in each processor from a left 
(“upward”) shift by one bit position. The most significant bit is lost, and a separate 
operation moves a zero into the least significant bit position. Of course, the same op¬ 
eration could be performed in place by specifying the source field as the destination 
field. 



Figure 10. A left shift by one bit position 


Routines that perform right and left shifts by any number of bit positions might be 
implemented as follows: 
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Example 6. Performing unsigned shifts: unsigned-shift.c 


#include <cm/paris.h> 

shift__left( dest, source, shift_amt, len ) 

CM_field_id_t dest, source; 
int shift_amt, len; 

{ 

CM_u_m° v e_lL( CM_add_offset_to_field_id( dest, shift__amt ), 
source, 

len - shift_amt ); 

CM_u_move_ z ero_lL( dest, shift_amt ); 

} 


shift_right( dest, source, shift_amt, len ) 

CM__.fi eld_id_t dest, source; 
int shift_amt, len; 

{ 

CM_u_jnove__lL ( dest, 

CM__add__of f set_to_fie1d_id( source, shift_amt ) , 
len - shift_amt ); 

CM_u_jn° v e_zero_lL ( CM_add__of f set__to__f ield_id(dest, 

len-shift_amt), 


} 


shift_amt ); 


Using Subfields 

Although Paris does not provide structured data types, programs can treat a subdi¬ 
vided field somewhat like a per-processor array or structure. Instructions can operate 
either on the entire field or on some or all of its subfields. 

The advantages of using subfields are: 

• Reduced allocation time. Be aware that allocating a CM field, even a stack field, 
requires much more overhead than allocating a C variable. 





Chapter 3. Computing within Processors 


35 


• More efficient data movement. The time required by the operations that move 
data between processors (including to and from the front end) is reduced by 
“packetizing” data items in a single field. 

The major disadvantage is: 

• Disables safety checking. The run-time safety utility (Chapter 10) checks, 
among other things, that data operands do not exceed the allocated length of 
the field that contains them. However, the safety utility has no information on 
the boundaries of subfields. 


A program that treats data both as a whole field and as subfields is shown in 
Example 7. A complex number is stored in a 64-bit field as two 32-bit subfields, one for 
the real part and one for the imaginary part: 
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Figure 11. Field-id’s in a subdivided field that represents a complex number 


The following program operates separately on the subfields to multiply the complex 
number by 10.0. It then operates on the whole field to copy the complex number to 
another field. 
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Example 7. Using subfields: simulate-complex-number.c 


^include <cm/paris.h> 

^include <stdio.h> 

#define LEN 32 

main () 

{ 

CM_field_id_t complex_field_l, realjpart, imag_part, 
complex_f ield__2; 

CM_init(); 

complex__f ield_l = CM__allocate__stack_f ield( 2*LEN ) ; 
real_part = complex_field_l; 

imag_jpart = CM_add_off set_to_f ield_id( complex_field_l,LEN ) ; 
CM_jset_context () ; 

/* Initialize the subfields with 32-bit floating-point numbers */ 

CM_f_random_lL( real_part,23,8); 

CM__f__random_lL( imagjpart, 23, 8 ); 

/* Multiply the complex number by the real value 10.0 */ 

CM__f_mu 11ip 1 y__constant__2_1 L( real_part, 10.0, 23, 8 ); 
CM_f_multiply__constant_2_lL( imagjpart, 10.0, 23, 8 ); 

/* Copy the complex number to another field */ 

comp 1 ex_f ie 1 d_2 = CM_allocate jstack_f ield( 2*LEN ) ; 
CM_u_ m °ve_lL ( complex_f ield__2 , complex__f ield_l, 2*LEN ); 

/* Other code */ 

/* Signal program completion from the front end */ 
printf( "\nProgram execution completed.\n" ); 


} 
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Notice that the last Paris call in this program uses the _u variant of CM_move to copy 
the field’s contents. This variant treats the contents as a straight binary number (the 
format of an unsigned integer). The _f variant should not be used, since it would take 
two length specifiers and treat the field’s contents as a single floating-point number. 


Creating a Parallel Structure 

A program can define a front-end structure and then mimic its memory layout within a 
CM field. The correspondence between the two layouts permits convenient data move¬ 
ment between the front end and the CM by means of the instructions that read and 
write to specified CM processors (see Chapter 7). 

This section shows a simple case of mimicking a structure in a field, followed by some 
example macros that users can define to implement the general case. 

First, consider an arbitrary front-end structure: 

typedef struct { 

int a; /♦ 32 bits */ 
char b; /* 8 bits */ 

} new_type; 

new_type my_fe_structure; 
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Figure 12. A CM field subdivided to match the layout of a front-end structure 
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Creating the parallel analogue of this particular layout is straightforward: 

CM_field_id_t my_cm_structure, subfield_a, subfield_b; 

my_cm_structure = CM_allocate_stack_field( 40 ); 

subfield_a = my_cm_structure; /* specify len=32 when using */ 

subfield_b = CM_add_offset_to_field_id( my_cm_structure, 32 ); 

/* specify len=8 when using */ 

More generally, a program can use macros to match a CM field to the layout of any 
specified front-end structure. Example 8 shows a possible implementation of such 
macros. These macros use sizeof to find the length in bytes of the front-end structure 
and then calculate the offsets of the structure’s members. These values are multiplied 
by 8 to convert the lengths to bits for the CM calculations. The macro parameters are: 

type the type of the front-end structure 

slotname the name of a front-end structure member 

obj the field-id of the CM field that mimics the structure 


Example 8. Creating a parallel structure: create-cm-struct.h 


^define TYPELEN(type) ( 8 * sizeof( type )) 

#define ALL0CATE_TYPE(type) \ 

CM_allocate_stack_field( TYPELEN( type )) 

^define SLOT_OFFSET(type,slotname) \ 

(8 * (int) &(( ( type * )0 )->slotname )) 

#define CMREF(obj,type,slotname) \ 

CM_add_offset_to_field_id( (obj), SLOT_OFFSET(type,slotname)) 


Thus, to mimic the layout of my_fe_structure in a CM field called my_cm_structure: 

my_cm_structure = ALL0CATE_TYPE( new_type ); 


To reference the subfield that corresponds to member b: 
CMREF( my_cm_structure, new_type, b ) 





Chapter 4 

Context and Control 


Data parallel programming is distinguished from both serial programming and con¬ 
trol parallel programming by its single in-line thread of control. Each processor in the 
CM processor array contains a different data point, but all execute exactly the same 
instruction on their data points at the same time. Any processors for which the instruc¬ 
tion is not relevant are instructed to sit idle for that time. We say that the context has 
narrowed to a smaller active set of (virtual) processors. 


NOTE 

Context relates to subselecting elements of a data set; it does 
not relate to selecting among data sets. For example, given a 
program that operates on points and lattices, context manipu¬ 
lation serves to subselect certain points or to subselect certain 
lattices. Context in the sense of choice among data sets—the 
points versus the lattices—is expressed in Paris through the 
concept of vp-sets and the current vp-set (see Chapter 5). 


In Paris programming, the flow of control is handled entirely on the front end under 
the direction of the C code in the program. That is, C statements direct all branching, 
looping, and recursive operations and all subroutine calls. These actions often deter¬ 
mine which Paris instructions are sent to the CM. Once the instructions are sent, how¬ 
ever, there is no further variation in the thread of control. The only control-related 
determination made on the CM is which processors are to participate in the next in¬ 
line instruction. 
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4.1 The Context Flag 

Each CM (virtual) processor contains a predefined one-bit entity called the context 
flag. This flag serves as a processor mask: the processors whose context flag is set to 1 
participate in the next instruction; those whose context flag is set to 0 do not. Context is 
always determined explicitly by the user program. There are no Paris instructions that 
set the context flag as a side effect. 

Although the context flag is like a field, Paris supports no way to reference it. Instead, 
Paris provides specialized instructions that operate on the context flag; all such in¬ 
structions have the element _context in their names. The commonly used instructions 
for manipulating context are: 

CM_set_context 

Makes all processors active (sets context flags to 1) 

CM_clear_context 

Makes all processors inactive (sets context flags to 0) 

CM_load_context source 

Moves the value from a specified one-bit source field into the context flag in all 
processors 

I 

CM_store_context dest 

Moves the value of the context flag into a one-bit destination field in all proces¬ 
sors 

CM_logand_context source 

Clears the context flag in all processors where the value in the one-bit source 
field is 0 

CM_logior_context source 

Sets the context flag in all processors where the value in the one-bit source field 
is 1 

CM_global_count_context 

Returns the number of active processors 

CM_global_logior_context 

Returns 0 if no processors are active or 1 if any processor is active 

For example, the following fragment adds 1 to all the odd values in f ield_a, leaving the 
even values unchanged. To narrow the context to only those processors whose field_a 
value is odd, the code calls CM_logand_context with field_a as source. Since this in- 
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struction operates on only one bit, the effect is to check the least significant bit in 
field_a and clear the context flag if that bit is 0 (that is, if the value is even). 

CM_set_context(); 

CM_u_random_lL( field_a, LEN, 255 ) ; 

CM_logand_context( field_a ); 

CM_u_add_constant_2_lL( field_a, 1, LEN ); 


All processors active before the initialization instruction: 
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Figure 13. Narrowing the active set of processors 


Unconditional Instructions 

Practically all Paris instructions are conditional, they execute only in those processors 
whose context flag is set to 1. A few instructions are unconditional: they execute in all 
processors regardless of the state of the context flag. The major categories of uncondi¬ 
tional instructions are: 

• Instructions whose name contains the element _context. These instructions 
operate on the context flag itself. 



























































42 


Programming in C/Paris 


• Instructions whose name contains the element _always. These instructions ig¬ 
nore the context flag. 

For example, CM_s_move_always_1 L is the unconditional variant of 
CM_s_move_1 L. The _always instructions are faster than the conditional vari¬ 
ants; they are useful when the program is unconcerned with preserving the val¬ 
ues in inactive processors. 

• Certain specialized communications instructions, identified in the Paris Refer¬ 
ence Manual. 

Another category of instructions that ignore the context flag are allocation intructions, 
such as CM_allocate_heap_field. Allocation is primarily a front-end operation; it sim¬ 
ply earmarks certain CM memory addresses as allocated and associates them with a 
front-end field-id. No action happens on the CM processors from an allocation in¬ 
struction, the context flags are not checked, and the field in question is always allo¬ 
cated across all processors. 


The Test Flag 

A common source of the values used to determine context is another predefined state 
bit, the test flag. 

The test flag is the destination for the Paris comparison instructions: eq, ne, gt, ge, It, 
and le. These operations are implemented for signed and unsigned integers and float¬ 
ing-point numbers. Each is provided in three forms: compare two fields, compare a 
field with a constant, and compare a field with zero. Within each processor, the in¬ 
struction compares the two values and returns true or false (1 or 0) to the test flag. 

For example: 

CM_s_eq_2L( field_a, field_b, 32, 16 ); 

CM_f_gt_zero_lL( field_c, 23, 8 ); 

CM_u_le_constant_lL( field_d, 100, 32 ); 

As with the context flag, Paris provides no way to reference the test flag directly. In¬ 
stead, Paris provides specialized instructions with the element _test in their names. 
These instructions are similar to the list of context instructions shown above. 
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Since the test flag is most often used as a source for loading the context flag, Paris 
provides a specialized (unconditional) instruction for making this transfer: 

CM_logand_context_with_test 

For example: 

CM_set_context(); 

CM_f_gt_zero_lL( field_c, 23, 8 ); 

CM_logand_context_with_test(); 

In processors where the value in field_c is greater than zero, the comparison instruc¬ 
tion sets the test flag; in other processors, it clears the test flag. The logand instruction 
then narrows the context by setting the context flag to 0 in all processors where the test 
flag contains 0. The next in-line instructions will execute only in those processors where 
the value in field_c is greater than zero. 


4.2 Conditional Constructs 

C provides one form of conditional operation, the if statement (with or without an ap¬ 
pended else clause). In C/Paris, we can distinguish three kinds of conditional opera¬ 
tions, which differ according to the nature of the control expression. These are: 

• Branching on a front-end condition 

• Branching on a front-end condition that is a reflection of CM state 

• Choice of actions within the CM according to each processor's value in a speci¬ 
fied field. 

The last operation is more properly called contextualization. This construct uses a field 
as the analogue of a control expression and manipulates CM context—activates and 
deactivates processors—according to the values in that field. 


With a Front-End Condition 

C/Paris does not extend the C if statement. The control expression provided must be 
scalar, and the action is what we would expect in C: 
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if ( condition) { 

/* block 1: Paris calls, with or without serial C code */ 

} 

else { 

/* block 2: Paris calls, with or without serial C code */ 

} 

In this paradigm, condition is a regular scalar expression on the front end. If it is non¬ 
zero (true), then block 1 executes; if it is zero (false), then block 2 executes. In no case 
do both blocks execute. 


With a Scalar CM Condition 

A useful variant in C/Paris is to use a scalar CM value as the control expression. As 
shown in Chapter 2, scalar CM values are returned as the result of either a global re¬ 
duction operation or of a call to CM _type_read_from_processor_1 L. 

For example, the overflow flag, mentioned in Chapter 3, is set as a side effect in any 
processor where an arithmetic result overflows the destination field. A program can 
check whether overflow has occurred and conditionally take action: 

CM_s_s_power_2_.lL( dest, source, LEN ); 

if ( CM_global_logior_overflow() ) { 

fprintf( stderr,"Overflow from exponentiation.\n" ); 
exit(); 

} 

else 

printf( "Exponentiation succeeded.\n" ); 


This fragment raises the value in dest to the source power, an operation that invites 
overflow in small fields. The instruction CM_global_logior_overflow returns 1 if any 
overflow flag is set. If overflow has not occurred, the global reduction value is 0 and 
only the else statement executes. 


With a Parallel CM Condition 

Paris does not support branching on the CM. In the parallel analogue of a conditional 
construct, all blocks are executed on the front end and on the CM. The program deter¬ 
mines which CM processors participate in any given block by manipulating context to 
reflect the local (per-processor) value in a field. 
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For example, compare the serial and parallel versions of a construct that increments a 
if a is less than 100, and decrements it otherwise. 

if ( a<l00 ) 

8,++ j 

else 
a— ; 

Where a is a field, processors need to take different actions according to the local value 
of a. All the actions are specified as Paris calls, and all are executed one at a time in 
exactly the order called. The context flag is set and cleared to indicate which proces- 


sors perform which action: 



CM_set__context () ; 



CM_u_lt_constant_lL( a, 100, LEN ); 
CM_logand_context_with_test() ; 

/* 

"condition 

CM_u_add_constant__2_lL( a, 1, LEN ); 

/* 

"then" */ 

CM_invert_context(); 

CM_u_subtract_constant_2_lL( a, 1, LEN ); 

/* 

"else" */ 


Context set to increment values less than 100: 
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Context set to decrement values greater than or equal to 100: 
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Figure 14. Manipulating context to express a parallel condition 
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Notice that there is no branching in this fragment: all lines of code are executed. The 
processors where a is less than 100 execute the add instruction while the others sit idle; 
then, the processors where a is greater than or equal to 100 execute the subtract in¬ 
struction while the others sit idle. 

Because contextualization does not involve branching, there is no reason why the re¬ 
spective sets of active processors have to be mutually exclusive. The active sets in the 
above fragment are mutually exclusive only because we used CM_invert_context, 
which reflects the initial state of the field. 

In contrast, consider a fragment where the second contextualization reflects the state 
of the field after the first operation, rather than the initial state. 

CM_set_context(); 

"condition" */ 
"then" */ 


"condition" */ 
"then" */ 

Values in field a that were originally less than 100 might become equal to 100 after the 
add operation. Therefore, it is quite possible for a processor to participate in both 
“branches” of the operation, as the middle processor in Figure 15 does. 

Notice that the above fragment calls CM_set_context before the second comparison 
instruction. The comparison instructions are conditional; if the context were not reset, 
the second comparison would execute only within the narrowed context, and only the 
middle processor in Figure 15 would execute the subtract instruction. 


CM_u_lt_constant_lL( a, 100, LEN); /* 

CM_logand_context_with_test(); 

CM_u_add_constant_2_lL( a, 1, LEN); /* 

CM_set_context(); 

CM_u_ge_constant_lL( a, 100, LEN); /* 

CM_logand_context_with_test(); 
CM_ii_s u btract_constant_2_lL( a, 1, LEN ); /* 
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Context set to increment values less than 100: 
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Context set to decrement values greater than or equal to 100: 
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Figure 15. Manipulating context into non-mutually-exclusive active sets 


With Nesting and Returns 

Finally, the explicit manipulation of context permits constructs that resemble nested if 
statements and returns to previous “levels” or contexts: 

• Nested conditionals are expressed by progressively narrowing the context. 

• Returns are expressed by saving the context at a given point and later restoring 
the saved context. 

For example, the following fragment begins by narrowing the context. This code shows 
the methods for further narrowing and for inversion within the narrowed context (the 
“nested” conditional). It also illustrates returning to a previous context after the last 
operation. 
































































48 


Programming in C/Paris 


Example 9. Simulating nested conditionals and returns: nesting-and-returns.c.fragment 


/* Narrow the initial context in some way */ 

CM_load_context( source ); 

/* Save the current context */ 

CM_store_context( saved_context ); 

/* The condition */ 

/* any one-bit field or flag, such as the test flag 
after a comparison instruction */ 

/* Deactivate processors where the condition is false */ 
CM_logand_context_with_test () ; 

/* Block 1: the "then” block */ 

/* various Paris and front-end operations */ 

/* Invert the context but do not activate any processors that 
were not active at the time of the condition */ 

CM_invert_context(); 

CM_logand_context ( saved__context ) ; 

/* Block 2: the "else” block */ 

/* various Paris and front-end operations */ 

/* Restore context as it was at the time of the condition */ 
CM_load_context( saved_context ); 


4.3 Iterative Constructs 

The data parallel model provides several constructs that are analogous to serial loop¬ 
ing operations. These are: 

• “Iteration” over data points 

• Iteration with a front-end termination condition 

• Iteration with a CM termination condition 
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The behavior of the.C while, do, and for statements is not extended in anyway, but the 
embedded statements can be Paris calls and the control expression can be a reflection 
of CM state. 


Iteration and Parallelism 

“Iteration” over data points, when performed on CM data, requires no special control 
constructs: it is the essence of data parallel programming. For example, consider a 
trivial C program that iterates several times over array elements: 


Example 10. An iterative C program: add-array-elements.c 


^include <stdio.h> 

#define ARRAY_SIZE 16384 

main() 

{ 

int a[ ARRAY_SIZE ], b[ ARRAY_SIZE ], sum[ ARRAY_SIZE ]; 
int i, agg_sum; 

for( agg_sum = 0, i=0; i<ARRAY_SIZE; i++ ){ 
a[i] = 2; 
b[i] = 3; 

sum[i] =a[i] +b[i]; 
agg_sum += sum[i]; 

} 

printf( "\nThe sum of all the sums is %d.\n", agg_sum ); 

} 


This program is the functional equivalent of the simple C/Paris program add-parallel- 
constants.c shown in Chapter 2. The operations that are iterative in C—initializing 
the arrays, summing the elements, and computing the aggregate sum—are expressed 
in Paris as parallel operations: 

CM_s_move_constant_lL( field_a, 2, LEN ); 

CM_s_move_constant_lL( field_b, 3, LEN ); 
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CM_s_add_3_lL( field_sum, field_a, field_b, LEN ); 
agg_sum = CM_global_js_add_JLL( field_sum, LEN ); 


Iteration with Scalar Termination 

Iteration can also be performed within CM processors. In one construct, the loop con¬ 
tinues in every processor until a front-end control expression is false. The paradigm is 
straightforward: 

while (front-end expression) { 

/* Various Paris calls with or without serial C code */ 

} 

For example, the following program loops over subfields in a field as if they were a 
per-processor array. It creates a subfield and initializes it, and then repeats this action 
in every processor until the front-end condition i<SUBFIELD_COUNT is false. The value 
moved in is incremented with each iteration: the first subfield gets 0, the second gets 1, 
and so on. In this example, five subfields are created in each processor, as shown in 
Figure 16. 



Figure 16. Results of an iterative subdivide-and-initialize operation on a field 
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Example 11. Per-processor iteration with front-end termination: seed-local-arrays.c 


^include <cm/paris.h> 

#include <stdio.h> 

#define SUBFIELD J30UNT 5 
#define SIZE 16 

main() 

{ 

CM_field_id_t an_array, subfield[ SUBFIELD_COUNT ]; 
int i; 

CM_init() ; 

CM_set_context() ; 

an_array = CM_allocate_heap_field( SIZE * SUBFIELD_COUNT ); 

for ( i=0; i<SUBFIELD_COUNT; i++ ) { 
subfield[i] = 

CM_add_offset_to_field_id( an_array, i * SIZE ); 
CM_s_move_lL( subfield[i], i, SIZE ); 

} 

/* other code */ 

printf( "\nProgram execution completed.\n" ); 

} 


Notice that because termination is determined by a front-end constant and the CM 
context does not change, all the CM processors perform the body of the for construct 
the same number of times. 


Iteration with Parallel Termination 

When the termination of an iterative construct depends on a CM value in each proces¬ 
sor, the processors do not necessarily perform the action the same number of times. 
The control expression must be a scalar value, but it can be a reflection of CM state. 
That is, the control expression can be either the result of either a global reduction op¬ 
eration or of a call to CM_type_read_from_processor_1 L. A paradigm is: 
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while ( global-reduction-result) { 

/* Various Paris calls with or without serial C code */ 

} 

With this construct, it is possible to have CM processors repeat some action until a 
local condition is met. As each processor reaches the termination condition, that proc¬ 
essor is deactivated but the others continue. The front end repeatedly checks CM con¬ 
text and continues the front-end loop until no CM processors are left active. 

For example, the following procedure takes a field of integers and its length and calcu¬ 
lates the base-2 logarithm of each value in the field. The action is to divide by 2 repeat¬ 
edly until the quotient is less than 1; the number of iterations required to meet this 
condition is the the floor of log2 of the original value. When the value in a processor 
becomes less than 1—or, since we are working here with integers, equal to 0—that 
processor becomes inactive. The front end calls CM_gIobalJogior_context before 
each iteration, and terminates the loop when this instruction returns 0. 


Example 12. Per-processor iteration with CM termination: log2-of-int.c 


#include <cra/paris.h> 

log2_of__int ( source, result, s_len, r__len ) 

CM__field_id_t source, result; 
unsigned int s_len, r__len; 

{ 

CM___f ield__id_t temp__source, saved__context; 

temp_source = CM_allocate_stack_field( s__len ); 
saved__context = CM_allocate_jStack__f ield ( 1 ); 

CM_jstore__context ( saved_context ); 

CM_s_move__lL( temp__source, source, s__len ); 
CM_s__move_zero_lL( result, r_len ); 

CM__s_ne_zero_.lL( temp_source, s__len ); 
CM_logand_eontext__with_test () ; 

while( CM_global_logior_context() ){ 

CM__s__truncate__constant_2_lL ( temp_source, 2, s__len ); 
CM_s_add__constant__2_ll ( result, 1, r_len ); 
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CM_js_ne_zero_lL( temp_source, s_len ); 
CM_logand_context__with__test () ; 

> 

CM_load__context ( saved_context ) ; 
CM_deallocate_stack_through( temp_source ); 



A 





Chapter 5 

Configuring Virtual Processors 


Paris presents to the user an abstract machine architecture that is, not surprisingly, 
very much like the physical Connection Machine architecture. The one major exten¬ 
sion is virtual processing, which permits a program to specify nearly any number of 
processors when it allocates data on the CM. 

The virtual processing mechanism enables each CM physical processor to simulate 
some specified number of virtual processors (VPs). The processor’s memory is shared 
among several VPs, and the physical processor is automatically time-sliced among the 
data that pertains to all the virtual processors that it is simulating. For instance, with a 
virtual processor ratio of 4, four instances of each memory field are allocated in each 
physical processor’s memory and the physical processor performs each instruction 
four times in sequence. 

The mapping of virtual processors onto physical processors is transparent to the user. 
Paris encourages programmers to think entirely in terms of virtual processors, both for 
memory allocation and for computation. In this view, the virtual machine appears as 
an rt-dimensional array of processors whose size and shape is under program control. 


5.1 Why Virtual Processors? 

Given that the data parallel programming model associates a processor with each data 
point, virtual processing greatly increases the expressive power of Paris: 

• Large data sets. Although the CM system provides up to 65,536 processors, it is 
not unusual for data sets to have hundreds of thousands or even millions of 
elements. The virtual processor abstraction permits the number of CM proc¬ 
essors to be logically increased to that needed for any size data set. 
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• Scalability. Since the processors devoted to a problem are virtual rather than 
physical, a data parallel program is not tied to any given machine size. The 
virtual processor mechanism automatically provides the number of “proces¬ 
sors” called for, which allows a program to run unchanged on any size CM 
system. 

• Natural data layout. The virtual processor mechanism simulates the shape, as 
well as the size, of a data set. For instance, graphics applications are usually 
laid out as a 2-dimensional grid of pixels, while modeling the diffusion of heat 
through a metal block requires a 3-dimensional layout of data points. Pro¬ 
grams can configure virtual processors into n-dimensional grids and specify 
the length of each dimension, thus reflecting the natural shape of n-dimen¬ 
sional data points. 

• Optimized communications. Each virtual processor has a unique address, 
which allows any other processor (including the front end) to transfer data to 
it. In addition, specialized Paris instructions rely on the logical shape of a vir¬ 
tual processor grid to perform high-speed nearest-neighbor communications 
and cumulative computations along any of the axes of the grid. 

• Multiple data sets. Many problems, of course, have several data sets of different 
sizes and shapes. The virtual processor mechanism is dynamic: it allows differ¬ 
ent virtual configurations to coexist, and it permits a program to create and 
destroy virtual configurations as needed at run time. 


5.2 Overview of Configuration Procedure 

A particular configuration of virtual processors is called a vp-set. Each vp-set has a 
geometry , which specifies the number of virtual processors in the vp-set and their logi¬ 
cal organization in n-space. When CM memory is allocated, the field is associated with 
exactly one vp-set, and the field shares the geometry—the size and shape—of its vp- 
set. 

For example, consider an arbitrary CM field, field_a. The many flavors of field_a 
shown previously in this manual have all been allocated within Paris's default vp-set. 
When a program does not explicitly create a vp-set, the system creates one with a ge¬ 
ometry that is 2-dimensional (as nearly square as possible) and the same size as the 
physical machine. Thus, field_a in all previous illustrations is 2-dimensional, and its 
size is determined at run time. The virtual processor ratio (VPR) of the default vp-set is 
of course 1. 
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Alternatively, field_a could be allocated in an explicitly defined vp-set of nearly any 
size and shape. If the size is twice that of the physical machine, then the VPR of 
f ield_a’s vp-set is 2. If allocated in a vp-set with a multidimensional geometry, f ield_a is 
multidimensional and the VPR is the product of the dimension sizes divided by the size 
of the physical machine. 

The following fragment shows the essential procedure for creating a vp-set and associ¬ 
ating memory with it. The remainder of this chapter elaborates on these four steps: 

1. Create a geometry, using CM_create_geometry 

2. Create a vp-set associated with that geometry, using CM_allocate_vp_set 

3. Make the vp-set the current vp-set, using CM_set_vp_set 

4. Allocate CM memory in the current vp-set, using any of the field allocation 
instructions 


Example 13. Creating a vp-set: create-vp-ldim.c.fragment 


int 


dimensions [1]; 


CM_geome t ry__i d_ 

t 

geometry; 


CM_vp__set__id_t 


vp_set; 


CM_f ield_id_t 


field; 


CM__init () ; 




dimensions[0] 

= 

16384; 


geometry 

= 

CM_create_geometry( 

dimensions, 

vp_set 

= 

CM_allocate_vp_set( 

geometry ); 

CM_set_vp_jset ( 

vp_ 

jset ) ; 


field 

= 

CM_allocate_heap_field( 32 ); 

CM_set_context( 

); 




/* various operations on the field */ 
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5.3 Creating a Geometry 

A geometry is a front-end object that describes an n-dimensional grid of elements. 
When later associated with a vp-set, the geometry defines the configuration of the 
processors in that vp-set. 


Procedure 

The following instruction creates a geometry and returns on the front end a geometry- 
id, which the program can (optionally) assign to a variable of type CM_geometry_id_t: 

CM_create_geometry dimension-array rank 

The dimension-array operand is a C array whose element values are the lengths of the 
axes of the geometry. In the example above, this argument is an array of one element, 
initialized as 16384 (16K): 

int dimensions[1]; 
dimensions[0] = 16384; 

The second argument to CM_create_geometry is a rank, an unsigned integer that 
specifies the number of dimensions of the geometry. This value can be any integer 
from 1 to 31, inclusive. In this example, the rank is 1 and the call that creates the geome¬ 
try is: 

CM_geometry_id_t geometry; 

geometry = CM_create_geometry( dimensions, 1 ); 


Restrictions 

The current restrictions on defining geometries are: 

• The length of each axis must be a power of 2. Their product—the total number 
of virtual processors—is therefore a power of 2. 

• The product of the axis lengths must be an integer multiple of the physical size 
of the CM system or section that executes the program. This integer is the vir¬ 
tual processor ratio, which varies according to physical machine size. 

• It follows that the VPR must also be a power of 2. 
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Future versions of Paris may remove the restriction that VPRs must be a power of 2, 
but the restriction that they must be integer multiples of physical machine size is likely 
to remain. 


Examples 

For example, the following geometry definitions are all legal. 

• A 2-dimensional geometry of total size 32,768: 

CM_geometry_id_t georaetry_2D; 
int dim[2] = { 8192, 4 }; 

geometry_2D = CM_create_geometry( dim, 2 ); 

If a program containing this geometry executes on a 32K CM, the VPR is 1; on 
a 16K system, the VPR is 2. This program cannot execute on 64K physical 
processors because the VPR would be less than 1, violating the second restric¬ 
tion. 


• A 3-dimensional geometry of total size 16,384: 

CM_geometry_id_t geometry_3D; 
int dim[3]; 

dim[0] = 16; 
dim[1] = 512; 
dim[2] = 2; 

geometry_3D = CM_create_geometry( dim, 3 ); 

This program can execute only on a 16K or 8K set of CM processors. 

• A 1-dimensional geometry that is set to current machine size: 

CM_geometry_id_t geometry_lD; 
int dim[l]; 

dim[0] = CM_physical_processors_limit; 
geometry_lD = CM_create_geometry( dim, 1 ); 


This program can execute on any size CM system or section. 
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Retrieving Attributes 

Paris provides a number of instructions that inquire about the size and shape of a 
geometry and the attributes of its axes. All such instructions take a geometry-id, and 
some take an integer that identifies the axis of interest. The integer is the appropriate 
subscript of the dimension-array that was used to define the geometry. 

Examples of the inquiry instructions are: 

CM_geometry_total_processors geometry-id 
CM_geometry_total_vp_ratio geometry-id 
CM_geometry_rank geometry-id 
CM_geometry_axisJength geometry-id axis 
CM_geometry_axis_vp_ratio geometry-id axis 

The complete list of inquiry instructions appears in the Paris Reference Manual. Since 
these instructions pertain to a front-end object (the geometry), they are all uncondi¬ 
tional. 


Optimization 

The rank and dimension sizes define a geometry sufficiently for the system to provide a 
correct mapping of virtual to physical processors. Many such mappings are possible, 
and all are equally efficient for programs that involve little or no communication be¬ 
tween logical neighbors on a grid axis. 

In programs that do perform communications between nearest neighbors or cumula¬ 
tive computations along grid axes, the programmer might wish to specify further prop¬ 
erties of a geometry. These properties include: 

• The ordering of the axes, which influences the particular embedding of the logi¬ 
cal grid into the physical grid 

• The weight of the axes, which influences whether the virtual processors on an 
axis are laid out within, rather than across, physical processors or laid out 
across processors that are all located on the same CM chip 

The system optimizes interprocessor communication along certain axes of the geome¬ 
try according to their weight and ordering properties. To specify these properties, cre¬ 
ate the geometry by calling CM_create_detailed_geometry, which is described in the 
Paris Reference Manual and illustrated in Appendix G, “Drawing Lines.” 



Chapter 5. Configuring Virtual Processors 


61 


5.4 Creating a Vp-Set 


Procedure 

Once a geometry is defined, the program can use that geometry to create one or more 
vp-sets. 

CM_allocate_vp_set geometry-id 

This instruction takes a geometry-id and creates a vp-set of the size and shape de¬ 
scribed by the geometry. The instruction returns a vp-set-id of type CM_vp_set_id_t. 
As shown in Example 13 above: 

CM_vp_set_id_t vp_set; 

/* ••• V 

geometry = CM_create_geometry( dimensions, 1 ); 
vp_set = CM_allocate_vp_set( geometry ); 


Changing Shape 

The size of a vp-set is fixed at the time of the vp-set’s creation. It shape, however, can be 
changed at any time by associating it with a different geometry: 

CM_set_vp_set_geometry vp-set-id geometry-id 

For example, a program that operates on fields in a 3-dimensional configuration might 
need to change the shape temporarily to 1-dimensional, perhaps to permit a cumula¬ 
tive operation across all the processors. (Cumulative computations are performed 
along a grid axis, as shown in Chapter 6.) The virtual processors are therefore recon¬ 
figured into a different logical organization, although their total number does not 
change: 

int first_dim_array[3] = { 512, 16, 2 }; 

int second_dim_array[1] = 16384; 


/* ... */ 
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geometry_3D = CM_create_geometry( first_dim_array, 3 ); 
my_vp_set = CM_allocate_vp_set( geometry_3D ); 

/* various operations in 3 dimensions */ 

geometry_lD = CM_create_geometry( second_dim_array, 1 ); 
CM_set_vp_set_geometry( my_vp_set, geometry_lD ); 

/* operations on the same data points in 1 dimension */ 

This fragment reconfigures a 3-dimensional grid of processors into a 1-dimensional 
grid of the same total size. No data actually moves, but the logical layout of data points 
is changed. Be aware that the mapping of processors between the two grids is not what 
one might expect from similar operations on serial computers: specifically, the layout 
of processors is neither row-major nor column-major. Paris programs should not de¬ 
pend on any particular mapping of processors from one grid to another. 

The geometry-id currently associated with any vp-set can be retrieved by executing: 
CM_vp_set_geometry vp-set-id 


Deallocating Geometries 

Associating a vp-set with a new geometry implicitly destroys its association with its 
previous geometry. If the previous geometry will not be used again, the program can 
free up system resources by deallocating it: 

CM_deallocate_geometry geometry-id 

It is an error to deallocate a geometry that is still associated with some vp-set. 


Deallocating Vp-Sets 

A program can also deallocate vp-sets that are no longer needed, provided that they no 
longer have storage allocated within them. Deallocating a vp-set does not affect its 
associated geometry. 

CM_deallocate_vp_set vp-set-id 


It is an error to deallocate a vp-set that still has memory fields associated with it. 
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5.5 Setting the Current Vp-Set 

A program can create any arbitrary number of vp-sets, but only one vp-set is active at a 
time. This vp-set, known as the current vp-set, is the only vp-set in which Paris instruc¬ 
tions can execute. All other vp-sets are latent: they cannot execute instructions. 

Certain interprocessor communication instructions can operate across vp-sets, as de¬ 
scribed in Part III of this manual. However, only the VPs in the current vp-set perform 
the action of the instruction (sending or getting messages). The VPs in the other, non- 
current, vp-set are simply the passive destination or source of the messages. 


Procedure 

To make a vp-set current, use: 

CM_set_vp_set vp-set-id 
As shown in Example 13: 


CM_set_vp_set( vp_set ); 

The default vp-set coexists with any vp-sets that the program creates. Until this in¬ 
struction is executed, the default vp-set is the current vp-set. After this instruction is 
executed, all Paris instructions operate only within the argument vp-set until such time 
as the current vp-set is changed again. 


Retrieval 

The ID of the current vp-set is always available as the value of the Paris variable 
CM_current_vp_set. For example, to determine the number of processors in the cur¬ 
rent vp-set, a program could call: 

int n; 

n = CM_geometry_total_processors 

( CM_vp_set_geometry( CM_current_vp_set )); 

If the program has performed any operations within the default vp-set, it is wise to give 
this vp-set an identifier before making a user-defined vp-set current. Without such an 
identifier, the default vp-set cannot be referenced or made current again. 








64 Programming in C/Paris 

For example: 

CM_vp_set_id_t apples, oranges; 

/* various operations within the default vp-set */ 

apples = CM_current_vp_set; 

CM_jset__vp_jset ( oranges ); 

/* various operations within the user-defined vp-set */ 
CM_jset_vp_set( apples ); 

/* various operations within the original vp-set */ 


5.6 Allocating Memory 

At the time of its creation, a vp-set has the full complement of flags (context, test, over¬ 
flow, and carry) but no associated memory. The program uses the storage allocation 
instructions to allocate memory fields in vp-sets. Each field is allocated in all virtual 
processors in the vp-set. 


Procedure 

The instructions introduced earlier allocate fields in the current vp-set: 

CM_allocate_heap_field length 
CM_allocate_stack_field length 

Programs can also allocate fields in a vp-set that is not necessarily current: 

CM_allocate_heap_field_vp_set vp-set-id length 
CM_allocate_stack_field_vp_set vp-set-id length 

To determine which vp-set a field is associated with, use the following instruction 
(which returns a vp-set-id): 
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Memory Layout 

Normally, it is useful to think of each virtual processor as having its own allocated 
memory and to picture virtual processors’ memories as separate, per-processor heaps 
and stacks. The diagrams shown in Chapter 3 in the discussion of CM storage manage¬ 
ment are intended to be abstractions of virtual, not physical, processors and their 
memories. 

However, a brief digression into physical layout is useful for clarifying the restrictions 
on field deallocation and for predicting the efficiency of Paris programs. 

The determining factor in physical layout is the VPR, which is the number of virtual 
processors that a physical processor is simulating for each vp-set. In the mapping of a 
set of virtual processors onto physical processors, all physical processors are used and 
the virtual processors are “spread out” as much as possible across the physical proces¬ 
sors. This amounts to saying that the VPR is never less than 1 and that it is kept as low 
as possible. 

The VPR of each vp-set is determined at run time by the number of physical processors 
available to the program in relation to the size of the vp-set. For example, imagine that 
the ubiquitous field_a is allocated in a vp-set of size 32,768. Figure 17 shows the physi¬ 
cal layout of this field when the program executes on 32K processors (at left) and on 
16K processors (at right). 


■ 




H| 

1 






■ 






■ 

■ 

■ 

■ 

■ 

■ 


mi 

■ 

i 

■ 


111! 

m 

HI 

j§§ 


PH 

■ 

■ 

mi 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

n 




H 


Figure 17. Physical layout of a field in a 32K vp-set 


Notice that the shape of the vp-set is irrelevant to its physical layout in memory. Shape 
determines which VPs are considered nearest neighbors along a logical axis, but VPR 
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alone determines the number of “banks” required in each physical processor to ac¬ 
commodate the fields of a vp-set. The field shown in Figure 17 could be of any logical 
dimensionality, regardless of how many memory banks its VPR requires. 

When multiple vp-sets exist, each physical processor’s memory contains fields from 
every vp-set, and each field is replicated the number of times that its vp-set’s VPR re¬ 
quires. 

The VPs that are co-resident on a physical processor share a single heap and a single 
stack, and their fields are intermingled in more-or-less the order allocated. That is, 
stack fields are stored in exactly the order allocated, regardless of vp-set, whereas heap 
fields may depart from the allocated order to the extent that field deallocation has 
freed up space that was previously used, again regardless of vp-set. 

For example, consider a program that has three vp-sets: apples, oranges, and pears 
(shown below as Example 14). For each of the three vp-sets, the total number of virtual 
processors is the product of its dimension sizes (axis lengths), and its VPR is the num¬ 
ber of virtual processors divided by the physical machine size. Thus, when executing 


on 16,384 (16K) CM processors: 


apples: 

size = 128x128 = 

VPR = 16K/16K 

16,384 (16K) 

1 

oranges: 

size = 

VPR = 64K/16K 

65,536 (64K) 

4 

pears: 

size = 64x16x4x8 = 

VPR = 32K/16K 

32,768 (32K) 

2 

The physical memory layout of the fields in the three vp-sets is shown in Figure 18. In 
this figure, each physical processor is simulating 7 virtual processors: 1 processor in 


vp-set apples (4 fields allocated), 4 processors in vp-set oranges (2 fields allocated), 
and 2 processors in vp-set pears (3 fields allocated). The order of the fields reflects the 
order of their allocation, certainly within the physical stack and probably within the 
physical heap. 

The ranks and dimension sizes shown above and in Example 14 are arbitrary—what 
determines physical layout is the total number of processors and thus the VPR. For 
example, vp-set apples could as well have 1 dimension of size 16K or 3 dimensions of 
sizes 2, 8, and 1024. As long as the product of the dimension sizes is 16K, the virtual 
processors in apples will be laid out one-per-physical-processor on a 16K CM system 
or section. 
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Physical Processors 


P P P P P 


I! apples, VPR=1 


Hill oranges, VPR=4 


-apples_a 


oranges_a 


pears, VPR=2 


|" [ Unallocated memory 


pears_a 


pears_b 


EiilliBilB 


-apples_h 


pears_f 


stack 


oranges_f 


-apples_g 

-apples_f 


Figure 18. Physical layout of 9 fields in 3 vp-sets 


Recall from Chapter 3 that the heap fields can be deallocated in any order. However, 
the LIFO stack protocol applies to the physical stack , not to the stack fields associated 
with a given VP. That is, deallocating field apples_f causes the deallocation not only of 
apples_g and apples_h but also of oranges_f and pears_f. 


NOTE 

It is an error to access any stack field after an earlier stack field 
has been deallocated, regardless of vp-set. 
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Example 14. Code underlying Figure 18: vp-sets.c.fragment 


int 


CM_geome t r y_id_t 
CM_vp_set__id_t 
CM field id t 


dim_apples_geom[2] , 
dim__oranges_geom[l], 
diinjpears_jreom[4] ; 

apples_geom, oranges_geom, pears__geom; 
apples, oranges, pears; 
apples_a, apples_f, apples__g, apples_h, 
oranges_a, oranges_f, 
pears_a, pears__b, pears_f; 


CM_init(); 

/* ====================================================:==:===== * / 

/* Create vp-set apples, make it current, and allocate memory */ 

dim__apples_geom[0] = 128; 

dim_apples__geom[l] = 128; 

apples__geom = CM_create_geometry( dim__apples_geom, 2 ); 

apples = CM_allocate__vp_set ( apples__geom ); 

CM_jset_vp__set ( apples ); 

apples_a = CM_ailocate_heap_field( 32 ); 

apples_f = CM_allocate_jstack_f ield( 8 ) ; 

apples_g = CM_allocate_stack__field( 32 ); 

CM_set__context () ; 

/* various operations on fields in vp-set apples */ 


/* ===================================:=:==============:========= * / 

/* Create vp-set oranges from within vp-set apples */ 

dim_oranges__geom[0] = 65536; 

oranges__geom = CM__create_geometry ( dim_oranges__geom, 1 ); 

oranges = CM_allocate_vp_jset ( oranges__geom ); 

/* Make vp-set oranges current and allocate memory */ 

CM__set_vp_jset ( oranges ); 

oranges_a = CM_allocate__heap_field( 16 ); 

orange_f = CM_allocate__stack_f ield( 12 ) ; 
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/* various operations on fields in vp-set oranges */ 

/* Create vp-set pears and allocate memory in pears while still 
within vp-set oranges */ 

dim_pears_geom[0] = 64; 

dim_pears_geom[l] = 16; 

dim_pears__geom[2] = 4; 

dim_jpears_geom[3] = 8; 

pears__geom = CM__create_geometry ( diinjpears_geom, 4 ); 

pears = CM_allocate__vp_set ( pears__geom ); 

pears__a = CM_allocate_heap_field_vp_set( pears, 32 ); 

pears_b = CM_allocate_heap_field_vp_set( pears, 8 ); 

pears_g * CM_allocate_stack_field_vp_set(pears, 32 ); 

/* more operations on fields in vp-set oranges */ 

/ * ====================s================================s====:=====s=========== * / 

/* Make vp-set pears current and operate on its fields */ 

CM__set_vp_set ( pears ); 

/* various operations on fields in vp-set pears */ 

/* Make vp-set apples current and operate on its fields */ 
CM_set_vp_set( apples ); 

/* various operations on fields in vp-set apples */ 

/* Deallocate the earliest stack field; all stack fields in all 
vp-sets are deallocated. */ 

CM_deallocate_stack_through( apples__f ) ; 
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Aside from the constraint on deallocating stack fields, the physical layout of CM mem¬ 
ory never affects program behavior (although it may affect program performance). An 
understanding of physical layout does, however, help to clarify two points made earlier 
in this chapter: 

• The size of a vp-set is fixed, but its shape can change. The total size of a vp-set 
is reflected in its physical layout, which cannot change; its rank and individual 
dimension sizes, however, are unrelated to layout and thus can change. 

• The weight of an axis of virtual processors influences its mapping onto physi¬ 
cal processors. In the more fine-tuned geometries created by CM_create_de- 
tailed_geometry, the user can specify the axis on which the most interproces¬ 
sor communication will occur. If this axis fits within the number of memory 
banks required by that vp-set (that is, if VPR > dim[x], where x is the heavily 
used axis), the system can lay out the VPs for that axis entirely within a single 
physical processor, thus enhancing the speed of inter-v//tua/-processor com¬ 
munications. 

This discussion also makes it obvious that VPRs greater than 1 involve two penalties: 

• Memory usage increases with VPR in a linear fashion. The virtual processor 
mechanism sets no limit on the size of data sets that can be handled on a one- 
element-per-processor basis. However, physical memory can become a con¬ 
straint at very high VPRs. 

• Execution time for most operations increases with VPR in a linear fashion. 
With a VPR of 4, for instance, each physical processor loops serially over 4 
banks of memory and thus performs 4 times as many instructions as if virtual 
processors were not in use. The MIPS and FLOPS rates, however, are about the 
same as they would be for a data set the same size as the physical machine. 

An important exception is interprocessor operations that rely on local (same- 
physical-processor or same-chip) communications. The speed of these opera¬ 
tions at high VPRs is often sublinear—that is, faster than a linear extrapolation 
would predict. 


5.7 Processor Addresses 

In all the operations shown thus far, each virtual processor performs computations 
independently of other processors. However, very few useful applications decompose 
into such totally independent subproblems. Instead, processors often need to transfer 
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data among themselves; and the front end, of course, often needs to access specified 
processors for purposes of I/O or manipulating context. 

To facilitate interprocessor communications, each virtual processor in a vp-set has 
two addresses, each of which uniquely identifies that processor within that vp-set. The 
two addresses correspond to the two general models of interprocessor communication 
in the CM system: 

• A send address: a single integer that remains constant for each virtual proces¬ 
sor for the life of its vp-set. Any processor (including the front end) can access 
any other processor in the CM processor array by specifying the destination 
processor’s send address. 

Every processor can send a message to (or get a message from) another speci¬ 
fied processor, all at the same time. The procedures for communicating in ar¬ 
bitrary patterns to specified processor addresses are described in Chapter 7 of 
this manual. 

• A NEWS address: a set of coordinates that reflects a virtual processor’s grid 
position in the current geometry of its vp-set. Unlike send addresses, NEWS 
addresses are dependent on the current geometry of a vp-set and will change if 
the geometry changes. 

Nearest-neighbor communications and cumulative computations along grid 
axes are dependent on grid positions, although the program need specify only 
the pattern of communication (not actual coordinates) to perform these opera¬ 
tions on all processors in parallel. The procedures for communicating in regu¬ 
lar patterns are described in Chapter 6 of this manual. 







Part III 

Interprocessor Communications 




Chapter 6 

Communicating in Regular Patterns 


The virtue of organizing virtual processors as a logical grid is that each processor can 
access the memory of another processor without computing the other’s address. In¬ 
stead, processors can communicate in regular patterns by specifying only a grid axis 
and a pattern. 

The possible patterns for grid communication in Paris are: 

• Nearest-neighbor communication, where each processor gets a message from 
the processor that is next to it on a grid axis. 

• Remote-neighbor communication, where each processor gets a message from 
the processor that lies at some specified distance and in some specified direc¬ 
tion in the grid. 

• Block transfers of data between a front-end array and the CM processors that 
make up a grid of comparable size and shape. 

• Cumulative, or parallel prefix, computation, where some combining operation 
(such as addition) is performed cumulatively across all processors on a grid 
axis in a specified direction. 

Grid communications are sometimes called NEWS operations. The term NEWS has 
historical significance in designating the four nearest neighbors to any processor on a 
2-dimensional grid: North, East, West, and South. Paris now supports grids from 1 
dimension to 31 dimensions; the number of nearest neighbors to any given processor is 
2 n, where n is the rank (number of dimensions) of the current geometry. 

Although grid communication is only a subset of the communications that are possible 
on the CM system, it is an important subset that is optimized for speed. Specifically: 

• A processor need not calculate or check the address of the processor whose 
memory it is to access. 
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• There is no possibility of collisions, where more than one message arrives at a 
processor from a single operation, and thus no need to combine in-coming 
messages. 

• Paris guarantees that virtual processors that are nearest neighbors on a logical 
grid are also nearest neighbors on the physical grid. 

Paris performs virtual-to-physical mapping such that any two nearest-neigh¬ 
bor virtual processors are located either within the memory of a single physical 
processor or in physical processors on the same chip, or they are linked di¬ 
rectly by a single wire. Thus, the CM hardware directly supports grid commu¬ 
nications, no matter what the shape of the logical grid. 

All communication along grid axes—including nearest-neighbor, remote-neighbor, 
and cumulative communications—necessarily occur within the current vp-set. The 
only grid operation that involves more than one vp-set is CM_cross_vp_move_1 L, 
which copies data from a grid in the current vp-set to a grid in another vp-set. See the 
Paris Dictionary Supplement, Version 5.1, for information. 


6.1 Grid Coordinates 

Each virtual processor has a set of coordinates that define its position in the grid de¬ 
scribed by the current geometry of its vp-set. The rc-tuple of the grid coordinates for 
each processor is its grid address or NEWS address, by which another processor (in¬ 
cluding the front end) can identify that processor. 

The coordinates are specific to the geometry; changing the geometry of the vp-set 
changes the grid coordinates of the individual virtual processors. (See Chapter 7 for 
information on determining the new NEWS address of a given processor after a change 
of geometry.) 


Numbering Axes and Processors 

The axes of a grid and the virtual processors on each axis are numbered as one would 
expect in a Cartesian coordinate system: both the axes and the processors on each axis 
are numbered with sequential (contiguous) unsigned integers beginning with zero. For 
example, the numbering of axes and processors that results from the following vp-set 
definition is shown in Figure 19. 
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unsigned int dim[3]; 

dim[0] = n; 

dim[l] = m; 

dim[2] = 2; 

3D_geom = CM_create_geometry( dim, 3 ); 
my_vp_set = CM_allocate_vp_set( 3D_geom ); 

Recall that the product of n, m, and 2 must be a power-of-2 multiple of the physical 
machine size. 


Virtual Processors 



0,0,1 

1,0,1 

2,0,1 

3,0,1 

n,0,l 





n,l,l 

0,0,0 

1,0,0 

2,0,0 

3,0,0 

n,0,0 

n,2,l 

0,1,0 

1,1,0 

2,1,0 

3,1,0 

n,l,0 

n,3,l 

0,2,0 

1,2,0 

2,2,0 

3,2,0 

n,2,0 

n,m,l 

0,3,0 

1,3,0 

2,3,0 

3,3,0 

n,3,0 


0,m,0 

l,m,0 

2,m,0 

3,m,0 

n,m,0 



Virtual 

Processors 


Figure 19. Grid coordinates and axis numbers in a 3-D geometry 


Retrieving Grid Coordinates 

Paris provides an instruction by which each virtual processor can determine its own 
coordinate on a specified axis of the current geometry. The instruction places the coor¬ 
dinate value for each processor in a destination field in the same processor. 

CM_my_news_coordinate_1L dest axis dest-length 
This instruction is conditional and operates only within the current vp-set. 
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In most cases, it is safe to specify dest-length as, say, 16 or 32 bits. To determine the 
minimum length needed for the destination field, the program needs to compute the 
number of bits required to represent the highest coordinate value on a specified grid 
axis. For this purpose, we use: 

CM_geometry_coordinate_length geometry-id axis 

Example 15 uses these instructions to compute a 2-dimensional identity matrix. Each 
processor computes its own coordinates on the two axes, placing the values in fields x 
and y, respectively, as shown in Figure 20. 
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Figure 20. Grid coordinates and identity field values in a 2-dimensional geometry 


The code also creates a field identity and initializes it to 0 in all processors. The exam¬ 
ple then compares the values of x and y in each processor; in processors where the two 
are equal, the comparison instruction sets the test flag to 1. Loading the test flag into 
the context flag then deactivates all processors where the two coordinates are not 
equal. Finally, the example moves the value 1 into the identity field in all active proces¬ 
sors, which are those for which the two coordinate values are equal. 
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Example 15. Retrieving grid coordinates: identity-matrix.c.fragment 


current_geom = CM__vp_set__geometry ( CM_current_vp_set ); 

x_len = CM_geometry_coordinate_length( current_geom, 0 ); 

y_len = CM_geometry__coordinate_length( current_geom, 1 ) ; 

x = CM_allocate_stack_field( x_len ); 

y = CM__allocate_stack_f ield( y_len );- 

identity = CM_allocate__stack_f ield ( 1 ); 

CM_my_news_coordinate_lL( x, 0, x_len ); /* 0 specifies axis */ 
CM_my_news_coordinate_lL( y, 1, y_len ); /* 1 specifies axis */ 

CM_u_ m °ve_zero_lL( identity, 1 ) ; /* initialize identity field*/ 

CM_u_eq_2L( x, y, x_len, y_len ); /* set test flag if x = y */ 
CM_logand__context_with_test () ; /* deactivate if x not = y */ 

CM_u__move_constant_lL( identity, 1, 1 ) ; /* set identity to 1 */ 


6.2 Nearest-Neighbor Communication 

Nearest-neighbor, or NEWS, communication is the simplest and one of the most effi¬ 
cient means of interprocessor communication on the CM. Every virtual processor ac¬ 
cesses the memory of an immediate neighbor on the logical grid, all at the same time 
and in the same direction. 


Basic NEWS Instructions 

The basic NEWS communication instructions direct each active processor to send or 
get a message from the memory of its nearest neighbor in a specified direction. Since 
the concept of “neighbor” has meaning only on a grid axis, the instructions also take an 
axis operand: 

CM_get_from_news_1 L dest source axis direction length 
CM_send_to_news_1 L dest source axis direction length 
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The direction operand is specified in C/Paris as either CM_upward or CM_downward, 
both of type CM_communication_direction_t. The upward pattern indicates the 
neighbor with the next-higher grid coordinate on the axis; the downward pattern indi¬ 
cates the neighbor with the next-lower coordinate. 

Notice that these instructions take only one length specifier: the dest and source field 
operands are taken to be of the same length. Also, the two fields must be either disjoint 
(no shared bits) or identical (all bits shared); they may not overlap partially. 


NEWS Accesses and Context 

If all processors are active, then CM_get_f rom_news_1 L with direction CM_upward is 
exactly equivalent to CM_send_to_news_1 L with direction CM_downward. The differ¬ 
ence between the two instructions concerns context. The processor that performs the 
action of the instruction (getting or sending) must be active; the neighbor processor 
need not be active. (See Figure 21, which shows NEWS transfers between memory 
fields in a 1-dimensional grid of processors.) 
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Figure 21. Effect of context on NEWS communication 
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Paris also provides unconditional variants of the basic NEWS communication instruc¬ 
tions. The action of the two unconditional instructions is exactly equivalent if the direc¬ 
tion operands are opposite. 

CM_get_from_news_always_1 L dest source axis direction length 
CM_send_to_news_always_1 L dest source axis direction length 


Example of NEWS Communication 

To illustrate data transfers between nearest neighbors, Example 16 shows a simple 
procedure whereby each processor averages its value in a source field with the values 
of its two nearest neighbors in the same field. 


Example 16. Accessing two nearest neighbors: neighbor-average.c 


#include <cm/paris.h> 

neighbor_average( source, axis, source_len ) 

CM__field_id__t source; 
unsigned int axis, source_len; 

{ 

CM_field_id__t my_value, neighbor_up, neighbor_down; 

/* allocate temporary storage as subfields in each VP */ 
my_value = CM_allocate_stack_field ( source_len * 3 ) ; 
neighbor_up = 

CM__add_offset___to_field_id( my_value, source_len ); 
neighboredown = 

CM_add_offset_to_field_id( my__value, source_len * 2 ); 
/^initialize subfields*/ 

CM^moye^lLC my^value, source, source^len ); 
CM_u_move_zero_lL ( neighborship, source^len * 2 ); 

/* get value from source field in two neighbors */ 

CM_get_from^news^lL( neighborship, source, 

axis, CM_upward, source_len ); 
CMegetefrom__news_lL ( neighbor_down, source, 

axis, CM^downward, source__len ); 
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/*add the three values, divide by three, and truncate */ 
CM_u_add_3_lL( my_value, neighbor_up, neighbor_down, 
source_len ); 

CM_u_tru nca te_c ons t an t_2_lL( my_value, 3, source_len ); 

/*move result into source field and deallocate temporaries*/ 
CM_u_move_lL( source, my_value, source_len); 
CM_deallocate_stack_through( my_value ); 


Notice that this procedure does not need to identify any NEWS coordinates or even the 
current geometry (although it is an error if the axis argument exceeds the rank of the 
current geometry when the procedure is called). Because the only communication that 
occurs is between neighbors in a regular pattern, no processor needs to be identified 
by its address. 


Border Behavior 

The Paris NEWS instructions wrap when a processor is on the border of a grid. That is, 
the processor with coordinate 0, when accessing downward, accesses the highest-num¬ 
bered processor on the axis, and the highest-numbered processor accesses processor 0 
when accessing upward. Thus, the grid is by default a toroidal mesh. 

The program can change the default border behavior by identifying the processors on 
the border of the grid and deactivating them for the NEWS operation. It can then acti¬ 
vate only the border processors and specify some other operation. These actions re¬ 
quire the program to retrieve the NEWS coordinates of the processors on the axis of 
interest and compare them with the result returned by CM_geometry_axis Jength and 
with 0. 

For example, the following program (Example 17) performs the same NEWS opera¬ 
tions as the procedure shown above: it averages the values of each set of three neigh¬ 
bors on a grid axis. However, the processors on the ends of the axis do not participate 
in the operations. (This program differs from the example above in various arbitrary 
ways; for instance, it manages storage somewhat differently, and it initializes the 
source field with random numbers.) 
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The program in Example 17 uses the default (2-dimensional) vp-set, and all NEWS op¬ 
erations occur on axis 0. The actions can be depicted as in the diagram above. 


Example 17. Controlling grid border behavior: neighbor-average-no-wrap.c 


#include <stdio.h> 

#include <cm/paris.h> 

#define FIELD_LENGTH 16 

main() 

{ 

CM_f ield_id_t my__value, neighbor_up, neighbored own, 
news__c o °rdinate; 

unsigned int coord__length, end_coordinate; 

CM_geometry_id_t current_geometry; 


CM_init(); 
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/* identify the geometry of the current (default) vp-set */ 
current_geometry = CM_vp_set_geometry( CM__current_vp__set ); 


/* determine how many bits are needed for the coordinate on 
axis 0 of the current geometry */ 
coord_length = 

CM_geometry__coordinate_length( current__geometry, 0 ) ; 


/* allocate storage as subfields in each virtual processor */ 
news_coordinate = 

CM_allocateJieap__field(coord_length + FIELD_LENGTH * 3); 
my_value = 

CM_add_of f set_to_f ield_id( news_coordinate, 

coord_length ); 

neighbor__up = 

CM__add_of f s e t _t o_f ie 1 d__id ( my_value, FIELDJLENGTH ); 
neighbor_down = 

CM_add_of fset_to_f ield_id ( my_value, FIELD__LENGTH * 2 ) ; 


/* set context and initialize fields */ 

CM__set_context () ; 

CM__u__m o ve_ze ro __lL( my_value, FIELD_LENGTH * 3 ); 

CM__u_r an do m_lL ( my__value, FI ELD_LENGTH, 

1«( FIELD_LENGTH - 2 ) ); /* the limit operand */ 

CM_my_news_coordinate_lL( news_coordinate, 0, coord_length ); 


/* 


*/ 


/* Deactive the end processors on the first axis */ 
end__coordinate = 

CM_geometry_axis_length( current_geometry, 0 ) - 1; 
CM_u_ne__constant_lL ( news_coordinate, end__coordinate, 

coord_length ); 

CM_logand_context_with_test(); 

CM_u_ne_constant__lL ( news_coordinate, 0, coord__length ) ; 
CM__logand_context_with_test () ; 
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/* 


*/ 


/* get values from neighbors and average them */ 

CM_get — from__news_lL( neighbor_up, my_value, 

0, CM_upward, FIELD_LENGTH ); 
CM_get__from_news_lL ( neighbor_down, my_value, 

0, CM__downward, FIELDJLENGTH) ; 

CM_u__add_3_lL ( my__value, neighbor_up, neighbor_down, 
FIELDJLENGTH); 

CM_u__truncate_constant_2_lL( my_value, 3, FIELD_LENGTH ); 


/* ================== ======= = = === ====:= ====== = = = ===============: * / 

/* signal program completion from the front end */ 
printf( "Program execution completed.” ); 

} 


6.3 Remote-Neighbor Communication 

Remote-neighbor communication refers to the parallel transfer of data in regular pat¬ 
terns between processors that are not nearest neighbors on a grid axis. 

This form of communication differs from general communication (covered in Chapter 
7) in that each pair of communicating processors is in exactly the same spacial rela¬ 
tionship as all other pairs. Remote-neighbor communication thus shares the perform¬ 
ance optimizations of other grid communications (no address needed, no collisions 
possible), as well as hardware support for what is in effect a series of nearest-neighbor 
transfers. 

The most straightforward method of communicating with remote neighbors is by sim¬ 
ply making repeated calls to a NEWS instruction, perhaps with changes to the axis and 
direction operands. For example, in the Game of Life problem shown in Appendix A, 
each processor needs to check a value in each of eight neighbors on a 2-dimensional 
grid (the four NEWS neighbors and the four diagonal neighbors). The diagonal neigh¬ 
bors cannot be accessed directly, since the underlying hardware does not support 
them as nearest neighbors. However, two calls to CM_get_from_news_lL suffice to 
access a diagonal neighbor: 
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CM_get_from_news_lL( neighbor_N, source, 1, CM_upward, LEN ); 
CM_get_from_news_lL( neighbor_NE, neighbor_N, 0, 

CM_downward, LEN ); 

In the first line, every active processor gets a value from its upward neighbor on axis 1 
and stores the value in its own neighbor_N field. In the second line, every active proces¬ 
sor gets the value that its downward neighbor on axis 0 has just acquired and stores 
that value in its own neighbor_NE field. (See Appendix A for the complete program.) 
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Figure 23. Accessing diagonal neighbors in two NEWS operations 


For accessing remote neighbors that lie a power-of-2 distance on the same grid axis, 
Paris provides instructions that perform the operation in one step: 

CM_get_from_power_two_1 L dest source axis distance direction length 

CM_get_from_power_two_always_1 L dest source axis distance direction length 

These instructions take the same operands as CM_get_from_news_1 L, plus a distance 
operand. The distance operand is the base-2 log of the number of grid positions to be 
traversed. 
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6.4 Front-End Array Transfers 

The most convenient and efficient means of transferring large amounts of data be¬ 
tween CM memory and the front end is the Paris instructions that read and write 
NEWS arrays. These instructions transfer data between a virtual processor grid and a 
front-end array of comparable size and shape. Their implementation is optimized for 
comparatively high throughput. 

Separate array-transfer instructions are implemented for signed and unsigned inte¬ 
gers and for floating-point numbers. All the variants are unconditional. 

The instructions take a large number of operands that describe the front-end array 
and the NEWS grid, as well as the CM memory field to be accessed. The array or grid 
can be a subarray or a portion of a grid. See the Paris Reference Manual for a complete 
description of the operands to the array-transfer instructions. 

Briefly, the operands are: 


CM _type_read_f rom_news_array_1 L 


front-end-array. . 
offset-vector .... 

start-vector . 

end-vector . 

axis-vector . 

source . 

source-len . 

rank . 

dimension-vector. 
element-length . . 


front-end (dest) array 
offsets for dest array 
lowest NEWS coords 
highest NEWS coords 
axes of NEWS grid 
CM source (src) field 
length (bits) of src 
rank of dest array 
axes of dest array 
length (bytes) of dest 
array elements 


CM_fypc_write_to_news_array_1 L front-end-array. . front-end (src) array 

offset-vector .... offsets for src array 


start-vector . lowest NEWS coords 

end-vector .highest NEWS coords 

axis-vector .dims of NEWS grid 

dest .CM dest field 

dest-len .length (bits) of dest 

rank . rank of src array 


dimension-vector . dims of src array 
element-length . . length (bytes) of src 

array elements 
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For example, the following procedure transfers data from a 2-dimensional grid on the 
CM to a specified array on the front end. The procedure is simplified by having the 
grid be square; that is, the parameter array_edge_size specifies the highest NEWS coor¬ 
dinate on both dimensions of the grid. However, the procedure could easily be ex¬ 
tended to higher ranks and unequal axis lengths. Also, we would need to change only 
the Paris call to have this procedure write from the front end to the CM. 


Example 18. A block data transfer: read-news-array, c 


#include <cm/paris.h> 

#define RANK 2 

&et_square_array_from__cm( front_end_array, source__f ield, 

source__length, array_edge_jsize ) 
unsigned int *front_end_array; 

CM_f i e 1 d_ i d__t source__f ield; 

unsigned int source_JLength, array_edge_size; 

{ 

/* offsets into front_end_array */ 

int fe_offset_vector[RANK]; /* note signed integers */ 

/* the "start" NEWS coordinate of the CM grid */ 
unsigned cm_start_vector[RANK]; 

/* the "end" NEWS coordinate of the CM grid */ 
unsigned cm__end_vector [RANK] ; 

/* the NEWS axes to transfer */ 
unsigned cm__axis_vector [RANK] ; 

/* the dimensions of the front-end array */ 
unsigned fe_dim_vector[RANK]; 


/* Initialize parameters for the array transfer */ 


fe__of f set_vector [0] 
fe___off set_vector [1] 
cm_start_vector[0] 
cm_start_vector[1] 
cm_end_vector[0] 
cm_end_vector[1] 


= 0 ; 

= 0 ; 

= 0 ; 

= 0 ; 

= array__edge_size; 
= array_edge_size; 
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cm_axis__vector [0] 
cm_axis__vector [1] 
fe__dim_vector[0] 
fe_dim_vector[1] 


= 0 ; 

= l; 

= array_edge_size; 
= array_edge__size; 


/* Perform the array transfer */ 
read_from_news_array_lL 
( front__end_array, fe_offset__vector, 

cm_jstart_vector, cm__end__vector, cm_axis__vector, 
source__field, source_length, RANK, 
fe__dim_vector, sizeof( unsigned int )); 


6.5 Cumulative Communications 

Paris provides a number of extremely powerful instructions that combine computation 
and communication along an axis of a logical grid. Such operations are called parallel 
prefix operations, on the analogy of prefix operations in array processing (where the 
operation applies some combiner over all prefixes of an array). 

For instance, it is frequently useful to compute partial sums of the values along a grid 
axis, where each processor computes the total of itself and all processors before it in a 
specified direction. The last processor computes the grand total. This seemingly serial 
operation is in fact one of the most efficient parallel operations on the CM system. 

This section introduces the most basic of the parallel prefix operations, CM_scan. The 
extensions of this operation, CM_reduce, CM_spread, and CM_multispread, are de¬ 
scribed in the Paris Reference Manual. 


Scan Operations 

The CM_scan instructions take a binary associative operator @ and an ordered set of 
elements [ ao, a,, a 2 ,... ], and compute the ordered set 


[ ^o, (ao @ a,), (ao @ a^ @ a 2 ), ... ] 
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The action is intuitively obvious as the procedure for balancing a checkbook. Consider 
a set of checkbook transactions (credit and debit) stored one-per-processor along a 
NEWS grid in chronological order. The following instruction computes a running bal¬ 
ance: 

CM_scan_with_s_add_1L dest source axis length direction inclusion 

smode sbit 

Ignoring the last two operands for the moment, we see that most of the others are fa¬ 
miliar from the grid communications discussed previously. The inclusion operand de¬ 
termines whether a processor’s initial value is included in the computation; it can be 
either CR/Mnclusive or CM_exclusive. 

If the check transactions are stored in field transactions and the running balance is to 
be placed in field year_balance, the call looks like: 

CM_scan_with_s_add_lL( year_balance, transactions, /* axis= */ 0, 

LEN, CM_upward, CM_inclusive, 

CM_none, CM_no_field ); 

The last two operands are discussed in the next section. 

The CM scan operations are provided with the arithmetic combiners add, min, and 
max, as well as with logand, logior, logxor, and copy. The arithmetic combiners are 
provided in s, u, and f variants for signed and unsigned integers and floating-point 
numbers. Some examples: 

CM_scan_with_f_max_1 L dest source axis s-len e-len direction inclusion 

smode sbit 

CM_scan_with_copy_1 L dest source axis length direction inclusion 

smode sbit 

In a scan_with_type_max operation, for instance, each processor receives the largest of 
the values from the processors that precede it on the axis in the specified direction. In a 
scan_with_copy operation, each processor receives the value from the first processor 
on the axis in the specified direction. 

All the CM_scan instructions are conditional. Inactive processors are treated as if they 
did not exist: their dest field values are not changed, and their source field values are 
not considered in the running computation. 
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Segmented Scan Operations 

The last two operands of the CM_scan instructions support segmented scans, where the 
running computation “restarts” at specified points on the grid axis. 

For example, a simple scan_with_add of a year’s checkbook transactions yields partial 
totals (subtotals) for each of the transactions, ending with the net total for the year. To 
find the net total for each month, however, we need to divide the grid axis into monthly 
segments and restart the scan operation with each month. 

For this purpose, we use a one-bit field where the value 1 indicates the beginning of a 
new segment, or scan set. Assume for the moment that the field month_seg- 
ment_marker is set in this way. That is, assume that each processor whose transac¬ 
tions value is the first transaction in a month has the value 1 in the field month_seg- 
ment_marker and all other processors have the value 0 in that field. 

The call that computes net monthly cash balance is: 

CM_scan_with_s_add_lL( month_balance, transactions, /* axis=*/ 0, 

' LEN, CM_upward, CM_inclusive, 

CM_start_bit, month_segraent_marker ); 

This call restarts the running total at the beginning of each month. The month_balance 
value associated with the last transaction of each month is the net balance for that 
month only, not for the year to date. 

The smode operand CM_start_bit indicates that the instruction is a segmented scan. 
The alternative is CM_none, which was used in the non-segmented scan operation 
shown above. The last operand, sbit, is the field-id of the segment field, in this case 
month_segment_marker. If smode is CM_none, then sbit can be the dummy field-id 
CM_no_field. 

How do we compute the segment bit? Assume that each transaction is associated with 
a date and thus has a field my_month. Since the transactions are sorted chronologi¬ 
cally along a grid axis, we can use NEWS instructions to find the spots where adjoining 
processors have different my_month values. Each processor gets its downward neigh¬ 
bor’s month and compares it with its own month. If the two values are not equal, the 
processor sets its own month_segment_marker to 1. 

The procedure is: 

CM_get_from_news_lL( neighbor_month, my_month, 0, 

CM_downward, LEN); 
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/* clear the segment bit */ 

CM_clearJbit_always( month_segment_marker ); 

/* compare month with downward neighbor's month in each proc */ 
CM_u_ne__lL( my_month, neighbor_jnonth, LEN ); 

/* if the two months are not equal, set the segment bit to 1 */ 
CM_jstore__test ( month_segment_marker ) ; 


Other examples that use the CM_scan instructions can be found in Appendix F, “Lines 
of Sight,” and Appendix G, “Drawing Lines.” 



Chapter 7 

Communicating in Arbitrary Patterns 


Perhaps the most distinctive feature of the CM system is general communication, 
where each virtual processor transfers a message to any other processor by specifying 
the destination processor’s address. 

As with grid communication, every active processor communicates with some other 
processor, all at the same time. Unlike grid communication, the pairs of communicat¬ 
ing processors need not be in any particular spatial relationship to each other. In fact, 
the source and destination processors can be in different vp-sets. 

General communication is sometimes called router communication, a reference to the 
underlying packet-switching mechanism by which each message is routed along one of 
the many possible paths to its destination. The path chosen may vary according to the 
distance to be traversed and the amount of message “traffic” on a particular wire. 

This chapter presents the basic information needed to perform general communica¬ 
tion in a C/Paris program: 

• Computing processor addresses 

• Using the general communication instructions 

• Transferring data between designated CM processors and the front end 


7.1 Processor Addresses 

Every processor within a vp-set is uniquely identified by an unsigned integer called its 
send address. Like NEWS coordinates, send addresses are unique only within a vp-set; 
it is possible for virtual processors in different vp-sets to have the same send address. 
Unlike NEWS coordinates, however, a processor’s send address remains constant for 
the life of its vp-set, regardless of changes in the vp-set’s geometry. 
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In the present version of Paris, the send addresses in each vp-set are consecutive un¬ 
signed integers beginning with 0 and extending to the total size of the vp-set minus 1. 
However, this feature is an artifact of the present restriction of total vp-set size to pow¬ 
ers of 2. In future versions, the send addresses for a vp-set may not be consecutive. 


NOTE 

Paris programs should not assume that send addresses occupy 
a contiguous range. In particular, we discourage arithmetic on 
send addresses. For a contiguous ordering of all processors, 
please use a 1-dimensional NEWS grid. 


Computing Seif-Addresses 

Within the current vp-set, each active processor can compute its own send address and 
store that value in a destination field in the same processor: 

CM_my send_address dest 

The field size needed to store the send address can be determined from the vp-set’s 
geometry: 

CM_geometry_send_address_length geometry-id 
For example: 

unsigned int dest_len; 

CM_field_id_t dest_field; 

CM_geometry_id_t current_geom; 

current_geom = CM_vp_set_geometry( CM_current_vp_set ); 
dest_len = CM_geometry_send_address_length( current_geom ); 

dest_field = CM_allocate_stack_field( dest_len ); 

CM_my_send_address( dest_field ); 
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The send address reflects only the size, not the shape, of a vp-set. If the vp-set is later 
associated with a different geometry, the send address of each processor remains the 
same because the second geometry must be of the same total size as the first (see the 
discussion of changing geometries in Chapter 5). 

The send address of a virtual processor is composed of two parts, the physical part and 
the virtual part. The physical part indicates which physical processor supports the VP; 
the virtual part indicates a particular VP on that physical processor. The address is 
thus a reflection of the mapping of virtual to physical processors, which does not vary 
during program execution. The address may change between program runs, however, 
since the physical part varies with physical machine size and the virtual part varies 
with the vp-set’s virtual processor ratio (also a function of physical machine size). 


Computing Send Addresses on the Front End 

Any processor, including the front end, can compute the send address of any arbitrary 
CM processor from that processor’s NEWS coordinates in a geometry. Aside from 
CM_my_send_address, the instructions that convert NEWS coordinates to send ad¬ 
dresses are the only supported way to obtain a send address in Paris. 

Paris provides two instructions that compute, entirely on the front end, the send ad¬ 
dress of any single CM processor: 

CM_sendaddr_t 

CM_fe_make_news_coordinate geometry axis news-coord 
CM_sendaddr_t 

CM_fe_deposit_news_coordinate geometry send-address axis news-coord 

The first instruction takes a geometry-id, an axis within that geometry, and the un¬ 
signed integer coordinate of a processor on that axis. It converts the NEWS coordinate 
into a send address on the assumption that all coordinates other than the one specified 
are 0. If geometry is in fact 1-dimensional, the send address returned is complete. When 
a send address is stored on the front end, it is of type CM_sendaddr_t. 

If the geometry is multidimensional, we need to build on the partial send address re¬ 
turned by CM_fe_make_news_coordinate by adding information about the other co¬ 
ordinates. The procedure is to pass that partial address as an argument to CM_fe_de- 
posit_news_coordinate, along with the processor’s NEWS coordinate on another axis 
of the geometry. By successive calls to CM_f e_deposit_news_coordinate, the front end 
can build the complete send address of a CM processor from all its NEWS coordinates. 



96 


Programming in Cl Paris 


For example, suppose we want to compute the send address of the processor whose 
NEWS coordinates on axes 0, 1, and 2 of geometryJ3D are 10, 50, and 200: 

CM_sendaddr_t send_addr; 

unsigned int axis; 

CM__geometry_id_t geometry_3D; 

send_addr = CM_fejmake_news_coordinate ( geometry_3D, 0, 10 ) ; 
send_addr = CM__fe__deposit_news_coordinate( geometry_3D, 

send_addr, 1, 50 ); 

send^addr = CM_fe_deposit_news_coordinate( geometry_3D, 

send___addr, 2, 200 ) ; 


Computing Send Addresses on the CM 

Analogous instructions direct each active CM processor to construct a send address 
from a set of NEWS coordinates. The parallel instructions are similar to the front-end 
instructions except that the news-coord operand and the “result” are fields. Also, these 
instructions take a length specifier for the field news-coord: 

CM_make_news_coordinate_1 L geometry dest axis news-coord coord-len 
CM_deposit_news_coordinate_1 L geometry dest axis news-coord coord-len 

For example, suppose we want each processor in geometry_3D to compute its own 
send address from its NEWS coordinates. (This trivial exercise is an inefficient way to 
simulate CM_my_send_address, but it suffices to illustrate the generic procedure. 
More-useful examples are shown later in this section.) The procedure is: 

1. Compute the NEWS coordinates for all the processors. 

2. Create a field in the current vp-set to store the send addresses. 

3. Call CM_make_news_coordinate_1 L on the coordinates of one axis. 

4. Call CM_deposit_news_coordinate_l L successively on the coordinates of each 
remaining axis. 


Step 1 NEWS Coordinates 

Compute the processors’ NEWS coordinates (as explained in Chapter 6) and 
place them in, say, fields x, y, and z. 
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x_len = CM_geometry_coordinate_length( geometry_3D, 0 ) 
y_len = CM_geometry_coordinate_length( geometry_3D, 1 ) 
z_len = CM_geometry_coordinate_length( geometry_3D, 2 ) 

x = CM_allocate_stack_field( x_len ); 
y = CM_allocate_stack_field( y_len ); 
z = CM_allocate_stack_field( z_len ); 


CM_my_news_coordinate_lL( x, 0, x_len ); 
CM_my_news_coordinate_lL( y, 1, y_len ); 
CM_my_news_coordinate_lL( z, 2, z_len ); 


Step 2 Destination Field 

Next, create a field long enough to contain the send address. This field will be 
the destination field for all the calls to the address-building instructions. 

dest_len = CM_geometry_send_address_length( geometry_3D ); 
dest = CM_allocate_stack_field( dest_len ); 


Step 3 First NEWS Coordinate 

Then, beginning with any axis of the geometry, call CM_make_news_coordi- 
nate_l L, specifying the coordinate field on that axis and the length of the coor¬ 
dinate field: 

CM_make_news_coordinate_lL( geometry_3D, dest, 0, x, x_len ); 

This instruction converts a news coordinate into a send address on the as¬ 
sumption that all coordinates other than the one specified are 0. If geometry 
were in fact 1-dimensional, the send address would now be complete. 


Step 4 Other NEWS Coordinates 

Since the geometry is multidimensional, build the send address with succes¬ 
sive calls to CM_deposit_news_coordinate_1 L, once for each remaining axis of 
the geometry. Specify the destination field used in Step 3 in all these calls. 

CM_deposit_news_coordinate_lL(geometry_3D, dest, 1, y, y_len); 
CM_deposit_news_coordinate_lL(geometry_3D, dest, 2, z, z_len); 
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Converting Send Addresses to NEWS Coordinates 

Paris also provides instructions that extract NEWS coordinates back out from send 
addresses, either on the front end or in each active CM processor: 

unsigned int 

CM_fe_extract_news_coordinate geometry axis send-address 

CM_extract_news_coordinate_1 L geometry dest axis send-address dest-len 

The front-end instruction returns an unsigned integer that is the NEWS coordinate of 
the specified processor along the specified geometry axis. The send-address operand of 
the front-end instruction is of type CM_sendaddr_t. 

The CM instruction directs each active processor to perform the parallel analogue of 
that action: derive the appropriate NEWS coordinate of the processor specified in its 
send-address field and place the result in the dest field (which is of length dest-len). If 
geometry is multidimensional, we can compute the full NEWS address by making suc¬ 
cessive calls to CM_extract_news_coordinate_1L and storing the results in separate 
fields or subfields. 


Example of Address Conversions 

The instructions that convert between send addresses and NEWS coordinates are par¬ 
ticularly useful when a vp-set’s geometry changes. As mentioned in Chapter 5, the grid 
ordering of the processors in the new geometry is not what one might expect from serial 
computers. In fact, it is not possible to predict the new NEWS coordinates of particular 
processors from their coordinates in the previous geometry. 

The program can determine which processors are which by having them compute their 
send addresses before changing the geometry. Afterward, each processor can derive 
its new NEWS coordinates from the send address, which has remained unchanged. For 
example, consider a change in geometry from 2-dimensional to 1-dimensional: 


Example 19. Addresses and changing geometries: send-addr-to-news.c.fragment 


CM_geometry_id_t geom_2D, geom_lD; 
CM_vp_set_id_t my_vp_set 
CM_field_id_t send_addr, news_x; 

unsigned int send_addr_len, x_len; 
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/* create geometries, etc. */ 

my__vp_set = CM_allocate_vp_set ( georn_2D ); 
CM_set_vp_set ( my_vp_set ); 

/* various operations on the 2-D grid */ 


/* compute self-addresses */ 

send_addr__len = CM__geometry_send_address_length( geom_2D ); 
send_addr = CM_allocate_heap — field( send__addr_len ) ; 
CM_my_send_address ( send__addr ); 

/* change geometry to 1-D */ 

CM_set_vp__set_geometry ( my_vp_set, geom_lD ); 

/* prepare a field for the new grid coordinates */ 

x_len = CM_geometry_coordinate_length( geom_lD, /*axis =*/ 0 ); 

news_x = CM_allocate_heap_field( x_len ); 

/* Extract new grid coordinates from send addresses */ 
CM_extract_news_coordinate_lL( geom__lD, news_x, 0, 

send_addr, x__len ); 

/* various operations on the 1-D grid */ 


7.2 The Basic send Instruction 

General interprocessor communications are performed by the CM_send family of in¬ 
structions. These instructions move the contents of a source field in each active proces¬ 
sor to a destination field in any specified processor. (The opposite operation, CM_get, 
is described briefly in Section 7.4 below.) 

The simplest of the send instructions is: 

CM_send_1L dest send-address source len notify 




100 


Programming in C/Paris 


This instruction—and all the variants of CM_send— takes field-id’s for the message 
{source), the send address of the destination processor, and the field (desf) in the desti¬ 
nation processor where the message is to be deposited. 

The send instructions also take a length specifier that applies to both source and dest. 
The message can be any length that is legal for its CM data format, as detailed in Chap¬ 
ter 3. In addition, all the send instructions take a one-bit field, notify, that is set in the 
destination processor when the message arrives. 

The action of CM_send_l L is illustrated in Figure 24 for an arbitrary set of five virtual 
processors. (In this example, all the processors send to different processors; Section 
7.3 discusses cases where multiple processors send to the same processor.) 


Virtual Processors Virtual Processors 



Figure 24. Change in CM state from executing CM_send_1 L 


Notice that the destination field need not be cleared before a call to CM_send_1 L. In 
Figure 24, processor 32 starts with the value 100 in its dest field, but it is overwritten by 
the message from processor 34. It is wise, though, to clear the notify field, since a 
preexisting 1 in a processor that receives no message would defeat the purpose of noti¬ 
fication. 
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Notice also that the four fields are disjoint in this example, which is generally advis¬ 
able. Some field overlap is permitted, however, as detailed in the descriptions of the 
individual CM_send instructions in the Paris Reference Manual. 

Effect of Context 

All the CM_send instructions are conditional: only active processors can send a mes¬ 
sage. A processor does not need to be active to receive a message. (See Figure 25.) 



Figure 25. Effect of context on CM_send_1 L 


The notify bit is particularly useful when the program has manipulated context and 
needs to check which messages were sent. There is, of course, a performance penalty 
associated with notification. If no notification is desired, the program can supply the 
dummy field-id CM_no_field as the notify operand to the send instructions. 
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Example of a Simple Send 

A simple example of general communication on the CM is a program that transposes a 
square array (flips it over its diagonal). What follows are the essential actions of the 
program that appears in Appendix C, “Transposing an Array.” 

Assume that the program has created a vp-set matrix_vp_set with a 2-dimensional ge¬ 
ometry matrix_geometry. The objective is for each processor to transfer the value in its 
own field matrix_value to the same field in the processor that is its “opposite number” 
in the array. 

Determining the NEWS coordinates of the destination processors is straightforward, 
since each processor needs only to reverse its own NEWS coordinates in building the 
send address (that is, processor x,y sends to processor y,x). 

/* compute news coordinates */ 

CM_my__news_coordinate_lL ( x, 0, news__coord_length ); 
CM_my_news_coordinate_lL( y, 1, news_coord_length ); 

/* build send address by reversing coordinate values */ 

CMjmake__news__coordinate_JLL ( matrix__geometry, send_addr, 

0, y, news_coord__length ) ; 
CM_deposit_news_coordinate_lL( matrix_geometry, send_addr, 

1, x, news_coord_length ); 


The processors right on the matrix diagonal, where x = y, in effect compute their own 
send addresses. Therefore, the program should send to a temporary field rather than 
to the source field, since a processor cannot send to itself if the source and destination 
fields overlap. (See the discussion of field overlap under each of the send instructions 
in the Paris Reference Manual ) 

Assuming that the fields temp and notify have been allocated, the call that transfers 
the value in field matrix_value from the source processors to the destination proces¬ 
sors is: 

CM_jsend_lL( temp, send_addr, matrix_yalue, value_length, notify); 
The program then moves the value from the temp field into the original source field: 

CM_u_move_JLL( matrix_value, temp, value_length ); 
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Example of a Simple Send across Vp-Sets 

The procedure for sending messages across vp-sets is similar to sending within a vp- 
set. The source, or “sending,” processors must be in the current vp-set. The destination 
processors can be in any vp-set. Although send addresses are unique only within a 
vp-set, the system can determine from the dest operand which vp-set is meant, since a 
field is associated with exactly one vp-set. 

The only significant difference in the send procedure when working with multiple vp- 
sets is in determining the length of the send-address field. The send addresses are com¬ 
puted in the current vp-set, but their minimum length is determined by the geometry of 
the destination vp-set. The notify field, if any, should be in the destination vp-set. 

For example, consider the simple point-drawing procedure shown in Example 20. This 
procedure sends color values from one vp-set to specified processors in another vp-set. 
The destination image is a field in a 2-dimensional vp-set where each processor repre¬ 
sents a pixel in a graphic image. The operation is performed in the vp-set where the 
color values are stored. (The change in CM state from this procedure is shown after¬ 
ward in Figure 26.) 


Example 20. Sending across vp-sets: draw-points.c 


#include <cm/paris.h> 


draw_points( image, x, y, color, coord_length, color_length ) 


{ 


CM_fie1d_id_t 
unsigned int 


image, x, y, color; 
coord_length, color_length; 


CM_field_id_t 
CM_geometry_id_t 
unsigned int 


send_addr; 
geometry; 
send_addr_length; 


/* determine length of send address in image field's vp-set */ 

geometry = CM_vp_set_geometry( CM_field_vp_set( image )); 
send_addr_length = 

CM_geometry_send_address_length( geometry ); 
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/* allocate and initialize field for send address */ 

send_addr = CM_allocate_stack_field( send_addr_length ); 
CM_u_move_zero_lL ( send__addr, send_addr_length ); 

/* build send address from one news axis coordinate at a time */ 
CM_make__news_coordinate_lL( geometry, send__addr, 0, x, 

coord__length) ; 

CM_deposit_news_coordinate__lL( geometry, send__addr, 1, y, 

coord_length ); 

/* send color values into the image field (no notification) */ 
CM_send_lL( image, send_addr, color, color_length, 

CM__no_f ield ); 

/* deallocate stack field */ 

CM_deallocate_stack_through( send_addr ); 

} 



Figure 26. Change in CM state from executing draw-points.c 
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7.3 Handling Message Collisions 

The examples just shown of send instructions are special cases in that no processor 
received messages from more than one other processor, that is, there were no colli¬ 
sions. Real programs frequently need to select among or combine multiple messages 
that arrive at the same processor. 

The variants on CM_send_1 L are all means of specifying how to handle collisions. The 
combiners provided are overwrite, add, max, min, logand, logior, and logxor. The arith¬ 
metic combiners are each provided in s, u, and f variants for signed and unsigned inte¬ 
gers and floating-point numbers. A few examples: 

CM_send_with_overwrite_1 L dest send-address source len notify 
CM_send_with_s_add_1L dest send-address source len notify 
CM_send_with_f_max_1 L dest send-address source sig-len exp-len notify 

Unlike CM_send_1L, the CM_send_with_comZ>iner_1 L instructions do include the 
original contents of the destination field when selecting among or combining mes¬ 
sages. For instance, CM_ser>d_with_overwrite_1 L chooses any one of the messages or 
the original value (the one chosen is unpredictable). The instruction CM_send_with_ 
s_add_1 L adds the original value and all incoming messages. To exclude the original 
value from the operation, the program should prepare the destination field in the way 
specified for the particular instruction in the Paris Reference Manual. 

An example of handling collisions in general communication is seen in computing a 
histogram. This program, shown in its entirety in Appendix D, sends a 1 for each in¬ 
stance of a particular value in a source field to the appropriate “bin” or accumulator 
cell in another vp-set. Since the purpose of the exercise is to determine how many in¬ 
stances there are of each value, the send instruction used is CM_send_with_u_add_1 L. 
Example 21 shows the accumulator procedure from this program. 


Example 21. Sending with a combiner: accumulate-votes.c 


#include <cm/paris.h> 


accumulate_votes( accumulator_field, src_field, 

accumulator_length, src_length ) 

CM_field_id_t accumulator_field, src_field; 
unsigned int accumulator_length, src_length; 
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{ 


CM_ge omet ry_id_ t 
CM_f i e 1 d_ i d__ t 
unsigned int 


dest_geometry; 
send_address, vote; 
send_address_length; 


/* Get information about actual arguments. */ 
dest_geometry = 

CM_vp_set_geometry( CM_f ield__vp_set ( accumulator^ield )) ; 
send_address_JLength = 

CM_geometry_send_address_length( dest_geometry ); 

/* Allocate temporary storage. */ 

CM_jset_vp_jset ( CM_f ield__vp_set ( src_field )); 

vote = CM_allocate__stack_f ield( accumulator_JLength + 

send_address_length ); 

send_address = 

CM__add_of f s e t_to_fie 1 d_id( vote, accumulator_length ) ; 

/* Initialize temporary fields; setting vote to 1 with the full 
field length specified sets subfield send_address to 0. */ 
CM_u__move_constant__lL( vote, 1, accumulator__length + 

send_address_length); 


/* To construct send addresses for the accumulator vp-set, use 
the src_field value as the destination processor's news 
coord; then send a "vote" from each source processor. */ 

CM_make__news__coordinate__lL ( dest_geometry, 

send_address, 

/* axis = */ 0, 

/* coord = */ src__field, 
src_length); 

CM_send_with_u_add_lL ( accumulator_field, 

send__address, 
vote, 

accumulator_length, 

CM__no_f ield ) ; 


CM_deallocate_jstack_through( vote ); 


} 
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7.4 A Word on CM_get_1 L 

The opposite instruction to CM_send_1 L is 
CM_get_1 L dest send-address source len 

This instruction directs each active processor in the current vp-set to get a message 
from a specified field ( source ) in a specified virtual processor (send-address). The 
source processors need not be active, and they can be in any vp-set. The length specifier 
describes both the source and dest operands; it can be any unsigned integer that is a 
legal length for the CM data format of the message. 

Multiple processors can read (get) from a single processor without contention; thus, 
CM_get_1L has no variants analogous to the combiner variants on CM_send_1 L. 

Be aware that CM_get_1 L uses comparatively large amounts of temporary storage, 
since it computes and stores information about the path each message is to traverse 
between the source and destination processors. This instruction should be used judi¬ 
ciously in programs where CM memory is at a premium. Also, because of the need to 
store the path and then perform the data transfer, a get operation takes about twice 
the time of a comparable send operation. 


7.5 Front-End Communications 

The front-end computer can use a send address to read from or write to any single CM 
processor. The instructions that transfer data between the CM and the front end are 
provided in three variants for signed and unsigned integers and floating-point num¬ 
bers, respectively. These instructions are all unconditional. 

CM_s_write_to_processor_1 L send-address dest source-value len 
CM_u_write_to_processor_1 L send-address dest source-value len 

CM_f_write_to_processor_1 L send-address dest source-value s-len e-len 

int CM_s_read_from_processor_1 L send-address source len 

unsigned CM_u_read_from_processor_1 L send-address source len 

float CM_f_read_from_processor_1 L send-address source s-len e-len 

For all these instructions, the send-address operand is a front-end variable of type 
CM_sendaddr_t, computed according to the procedures shown above in Section 7.1. 
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The source or dest operand is a CM field (CM_field_id_t) of length len. The source- 
value (for write) and the result (for read) are the front-end types implied by the instruc¬ 
tion names. 

The following examples illustrate the transfer of data between the CM and the front 
end. These simple procedures compute the send addresses of some specified number 
of processors (num jprocs) on a specified axis of the current CM vp-set. They then read 
the values from a source field in those processors and print the values on the front end. 


Example 22. Printing CM values on the front end: cm-print.c 


#include <cm/paris.h> 


u_display_cinjnemory( source, len, axis, num_procs, header__string) 


CM_f ield___id__t 
unsigned int 
char 

unsigned int 
CM sendaddr t 


source; 

length, num_procs, axis; 
*header_string; 

i; 

send___addr; 


printf( ’^sXn", header_string ); 


for ( i=0; i<num_jprocs; i++ ) { 
send_addr = 

CM_fe__make_news_coordinate 

( CM_vp_set_geometry( CM__current__vp__set ), 
axis, 
i ) ; 


} 


printf( " %d\n", 

CM_u_read_from_jprocessor( send_addr, source, len )); 


/* 


*/ 


f_display_cm_memory ( source, s_len, e_len, axis, num__procs, 

header_string ) 


CM field id t 


source; 
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unsigned int s_len, e__len, num_j?rocs, axis; 
char *header_string; 

{ 

unsigned int i; 

CM_sendaddr_t send_addr; 

printf( ”%s\n", header_string ); 

for ( i=0; i<num_procs; i++ ) { 
send_addr = 

CM_fe_make_news_coordinate 

( CM_vp_set__geometry ( CM_current_vp_set ), 
axis, 
i ) ; 

printf(" %f\n", 

CM_f__read_from_processor ( send_addr, 

source, 

s__len, e_len )); 

} 

} 





Part IV 

Commands and Utilities 




Chapter 8 

Compiling and Executing Programs 


To experiment with the procedures in this chapter, use any of the C/Paris example 
programs shown in the previous chapters. Alternatively, a simple program that pro¬ 
duces some visible output is the following: 


Example 23. A program with visible output: count-active-set. c 


^include <cm/paris.h> 

#include <stdio.h> 

main() 

{ 

CM_init(); 

CM_set_context(); 

printf( "\nThe number of processors participating is %d.\n M , 
CM_global_count_context() ); 

} 


8.1 To Compile 

A C/Paris program is compiled with the front end’s C compiler in the same way as any 
C program that is to be linked with a specialized library (in this case, the Paris library). 
The program can be compiled on any VAX or Sun-4 that has CM System Software 
installed. 
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In Version 5.0, the Paris library is specified on the cc command line as -Iparis. (Please 
consult the documentation for later versions to determine whether the means of speci¬ 
fying the Paris library has changed.) 

% cc count-active-set.c -Iparis 

Many Paris routines rely on definitions in the UNIX (serial) math library, but this li¬ 
brary is not prelinked with the Paris library. Therefore, C/Paris programmers should 
specify the -Im option on the cc command line, placing it after -Iparis. 

% cc filename .c -Iparis -Im 


8.2 To Attach 

Before a C/Paris program can be executed, a front-end bus interface (or FEBI) must be 
attached to one or more sequencers within the CM. The CM System Software com¬ 
mand cmattach establishes this logical connection. It reserves for the program’s use 
the processors that the sequencer controls and initializes (cold boots) the sequencer 
and its processors. 



Figure 27. A front-end FEBI attached to a CM sequencer 
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Any number of users can develop and compile C/Paris programs at the same time on 
the front end. However, the number of users that can execute programs at once is lim¬ 
ited by the number of FEBIs on the front end and the number of sequencers on the CM. 
For example, Figure 28 shows a CM system with two FEBIs and four sequencers. One 
sequencer is free, but it cannot be accessed until one of the FEBIs becomes detached. 



Figure 28. Front-end FEBIs attached to CM sequencers 


The further action of cmattach, beyond attaching and initializing the CM processors, 
depends on whether the command is invoked in batch or interactive mode. If an exe¬ 
cutable filename is supplied on the command line, cmattach executes the program in 
batch (see next section); with no filename, cmattach prepares for interactive execution. 

When executed without options, cmattach attaches the first available FEBI to the first 
available sequencer. The on-line manual page for cmattach provides a full list of cmat¬ 
tach options and their defaults and legal values. Some commonly used options are: 

-w “Wait for resources.” The front end keeps trying to attach if no sequencer 
or no FEBI is available at the first try. This option is recommended when 
multiple users are competing for a share of CM hardware. For example: 


% cmattach -w my-program 
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-S “Sequencer.” The front end attaches to a particular sequencer or to more 
than one sequencer. A particular sequencer might be requested because it 
is associated with an optional floating-point accelerator or with a system 
I/O device. Multiple sequencers might be requested for programs with 
very large data sets. Some example command lines are: 

% cmattach -w -SI my-program 
% cmattach -w -S0-3 my-program 

-i “Interface.” The front end attaches a particular FEBI to the specified (or 
default) sequencer. This option is useful when the interfaces on a front end 
are physically connected to different CMs and a particular CM is desired. 
Some example command lines are: 

% cmattach -iO my-program 
% cmattach -w -il -S0-1 my-program 


8.3 To Execute in Batch 

Paris programs can be executed in batch mode in the UNIX foreground or background 
or on a remote machine that is also a CM front end. 

In the UNIX Foreground 

To invoke cmattach in batch mode in the foreground, supply the executable filename 
on the command line: 

% cmattach [options] executable-filename [arguments] 

When invoked in this way, cmattach performs the following actions: 

1. Attaches a FEBI to a CM sequencer 

2. Cold boots the sequencer and its processors 

3. Executes the specified program 

4. Detaches the FEBI from the sequencer 
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If no options beyond -w are specified, the screen display might look like the following 
example. Note the printf output from the sample program count-active-set. c. 

% cmattach -w count-active-set 

Attaching the Connection Machine system... 
coldbooting... done. 

Attached to 8192 physical processors 

The number of processors participating is 8192. 

Detaching... done. 

% 


In the UNIX Background 

A program can also be executed in the UNIX background: 

% cmattach [options] executable-filename [arguments ] >& output-filename & 
For example: 

% cmattach -w -q my-program >& output & 

% 


In this example, program output and any error messages are redirected to the file out¬ 
put. It is important to redirect both standard output and standard error, using >&; if 
both streams are not redirected, the program could be suspended waiting to write to 
the terminal. Note also the use of the option -q to suppress screen display of informa¬ 
tional messages arising from program execution. 


On a Remote Machine 

Finally, a program can be executed in batch on a remote VAX or Sun-4 that is also a 
CM front end. This is done in the normal UNIX manner with the command rsh and the 
name of the remote machine. In this case, it is especially important to specify the full 
pathname of the executable file. 

% rsh machine-name cmattach [options] path!executable-filename [arguments] 
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For example, the following command line causes my-program to be executed from the 
CM front end other-machine. The command line invokes pwd to find the path of a 
program that resides in the user’s present working directory: 

%rsh other-machine cmattach -w ’pwd’/my-program 

Attaching the Connection Machine system... 
coldbooting... done. 

Attached to 8192 physical processors 

\program output, if any] 

Detaching... done. 

% 

8.4 To Execute Interactively 

If cmattach is invoked without an executable filename, it attaches and cold boots the 
hardware and then spawns a subshell in which programs may be executed. It does not 
detach the hardware until the subshell is exited, even if no program is executing. 

% cmattach [options] 

Attaching the Connection Machine system... 
coldbooting... done. 

Attached to 8192 physical processors 

Entering CMATTACH subshell. Type "exit" or control-D 
to detach the CM. . . 

% my-program-1 

[program output, if any] 

% my-program-2 

[program output, if any] 

% exit 

Detaching... done. 

% 


The major recommended use of the cmattach subshell is in running shell-level utilities, 
such as the debugger dbx or the run-time safety checker, cmsetsafety. The procedure 
is described in Chapter 10. 




Chapter 9 

Programming Utilities 


This chapter describes three utilities that are useful in developing C/Paris programs: 

• The run-time safety utility 

• The debugger dbx 

• The Paris timer 


9.1 Run-Time Safety 

The run-time safety utility checks for certain errors and inconsistencies in user pro¬ 
grams. When it detects a user error, the utility aborts program execution and prints 
information about the error. The user should of course expect reduced execution speed 
when safety checking is enabled, but it can be a helpful tool in program development 
and debugging. 

The utility has two states, on and off. When enabled, the utility checks the following 
conditions: 

• Whether the field-id’s passed as arguments to Paris instructions refer to fields 
in the current vp-set 

• Whether the field-id’s passed as arguments to Paris instructions are valid 
field-id’s (although not all invalid field-id’s are caught) 

• Whether the length specifiers passed to Paris instructions exceed the lengths of 
the respective field operands 

This utility is intended for use with C/Paris programs only. It is not recommended for 
use in run-time checking of compiler output of the higher level CM languages. 
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From within a Program 

The safety utility is available as a Paris instruction, CM_set_safety_mode, which takes 
an unsigned integer argument. Any non-zero value enables the utility; a zero argument 
disables it. 


From the Shell 

The shell-level command cmsetsafety performs the same action as the Paris instruc¬ 
tion CM_set_safety_mode. 

% cmsetsafety [on] [off] 

By using the command rather than the Paris instruction, we can execute a program 
either with or without safety checking without changing the source file. However, the 
shell command does not permit safety checking of only selected parts of a program. 

The command cmsetsafety is executed from within a cmattach subshell. For example: 
% cmattach 

Attaching the Connection Machine system... coldbooting... done. 
Attached to 8192 physical processors 

Entering CMATTACH subshell. Type "exit" or control-D 
to detach the CM. . . 

% cmsetsafety on 

% my-program 

|program output, if any] 

% cmsetsafety off 

% my-program 

\program output, if any] 

% exit 

% 


Safety is initially off in a subshell. If cmsetsafety is executed with the option on, all 
programs are then executed with safety on until the safety mode is changed or the sub¬ 
shell is exited. 
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Changing Default Safety Behavior 

By default, all CM programs are executed with safety off, whether execution is interac¬ 
tive or in batch. To enable safety by default for all CM program execution, set the envi¬ 
ronmental variable CM_DEFAULT_SAFETY to on or ON in the .cshrc file: 

setenv CM_DEFAULT_SAFETY on 

If the variable is not set, or if it is set to any other value, safety is off for batch execution 
and initially off in an interactive cmattach subshell. Safety can, of course, be enabled at 
any time within a subshell by invoking by command cmsetsafety, as noted above. 

It is often convenient to set the defaults such that safety is off for batch execution but 
on in an interactive cmattach subshell. In particular, safety should be enabled when 
using the interactive debugger dbx from within a subshell, as described in the next 
section. To have safety be initially on in a subshell but off for batch execution, add the 
following line to the .cshrc file: 

if (S7CMDEVICE) cmsetsafety on 


9.2 The Debugger dbx 

Like any C program, a C/Paris program can be debugged interactively by means of the 
debugger dbx. As noted in the previous section, it is strongly recommended that the 
Paris run-time safety utility be enabled whenever dbx is in use. 


Invoking the Debugger 

The debugger is activated from within a cmattach subshell. The procedure is: 

% cmattach 

Attaching the Connection Machine system... 
coldbooting... done. 

Attached to 8192 physical processors 

Entering CMATTACH subshell. Type "exit" or control-D 
to detach the CM. . . 
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% dbx my-program 

dbx> [dbx commands such as run, s, and n] 
dbx> q 

% exit 

% 


Locating Paris Errors 

Locating an error in a C/Paris program is made somewhat difficult by the fact that the 
CM executes asynchronously with respect to the front end. After sending an instruc¬ 
tion over the bus to the CM, the front end continues to execute other code. If the CM 
signals an error, it is not immediately obvious which Paris call is at fault: 

• Since the queue for the instruction bus can accommodate over 200 instruc¬ 
tions, the CM may not execute an erroneous instruction until the front end is 
much farther along in the program. 

• CM error signals are not sent back to the front end as they occur. Instead, for 
reasons of system efficiency, the CM holds the error signals until the next time 
the front end reads data from the CM. 

The standard method for debugging programs is to insert breakpoints at various 
places before the point where the error is reported. For debugging a C/Paris program, 
we can force synchronization between the two machines at each breakpoint by interac¬ 
tively calling a Paris instruction that reads data from the CM. (CM_global_logior_con- 
text is commonly used for this purpose because it is fast and requires no arguments.) 

With each call to an instruction that reads CM data, any pending CM error messages 
are “piggybacked” to the front end. By using this method with a binary search strategy, 
we can usually isolate the offending Paris call quickly. 

Be aware that Paris functions are not linked into a program unless the program refer¬ 
ences them (directly or indirectly). To make sure that instructions used in debugging, 
such as CM_global _logior_context, are always linked, it is advisable to write a dummy 
function that calls all such instructions and place the dummy function is a separate 
file, say, debug.c. Then, link debug.o into any program that is still under development. 
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Examining CM Data 

Paris does not extend dbx to support examining values on the CM. The user can write 
procedures that print CM values in any desired format. An example of such a proce¬ 
dure, one that calls CMJypejreadJ rom_processor_1 L to display the values in any 
specified number of processors, is shown in Chapter 7. 

All such debugging routines can be made available in a program by including them in 
the file debug.c, mentioned in the previous section, and linking the program with the 
associated object file. 


9.3 The Paris Timer 

The Paris timer is a facility for recording program execution time—both the total 
elapsed time and the time during which the CM was active. The timer consists of a set 
of Paris instructions; it can be used only from within a program. 

The Timing Functions 

The timing facility consists of three functions: 

CM_start_timer begins accumulating timing information 

CM_stop_timer stops accumulating timing information and records (op¬ 

tionally, prints) the information 

CM_reset_timer erases accumulated timing information 

The information recorded is: 

• The CM’s active time (in seconds) since the call to CM_start_timer 

• Real time (in seconds) elapsed since the call to CM_start_timer 

• Percentage machine utilization, calculated as CM time divided by real time 

The functions CM_start_timer and CM_stop_timer can be inserted in a program to 
span the portion of the code for which timing information is desired. The program can 
also make successive calls to these two functions, much like starting and stopping a 
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stop watch. In this case, the information recorded is the cumulative time over the suc¬ 
cessive calls. To erase the accumulated time at any point, call CM_reset_timer. 

The exact behavior of the functions is as follows: 

CM_start_timer verbose 

Begins accumulating times. If the integer argument verbose is true (non-zero) 
and if there is a pause of 5 seconds or more while the timer is being calibrated, 
the instruction writes an explanation of the delay to the stream stderr. This 
message can appear only the first time the instruction is called in a program. 

CM_timeval_t *CM_stop_timer verbose 

Stops accumulating times and returns a pointer to a structure of type 
CM_timeval_t, where the updated times are stored. If the integer argument 
verbose is true (non-zero), the function writes the timing information to the 
stream stderr. 

The structure contains the members cmtv_real and cmtv_cm, both of type 
double. The structure is stored in static space; it must be copied if it is to be 
saved. 

CM_reset_timer 

Erases accumulated timing information. The function does not restart the 
timer. (It is not necessary to call this function before the first call to 

CM_start_timer.) 


Interpreting Timer Output 

To use the C/Paris timer effectively, it is helpful to understand something of how it is 
implemented. 

In the present release of Paris, the timer proceeds by counting the number of idle CM 
cycles during the code segment in question, rather than the number of cycles during 
which the CM is active. Idle cycles are those during which the CM sequencer is waiting 
to receive an instruction from the front end. 



Chapter 10. Programming Utilities 


125 


Since idle cycles are all of the same length, the CM’s active time can be computed by: 

1. Measuring elapsed real time with the front end’s real-time clock 

2. Multiplying the number of CM idle cycles by the (constant) time per cycle 

3. Subtracting total CM idle time from elapsed real time 

With UNIX front ends, two potential problems arise: 

• The UNIX real-time clock has lower resolution (on the order of 1 millisecond) 
than CM cycle time (on the order of 1 microsecond). Since the reported CM 
active time is computed directly from the real time as measured on the front 
end, distortions are introduced when timing code segments whose total 
elapsed time is under about 1 second. 

• UNIX machines typically have some degree of multiprocessing activity, even 
when only one user is logged in. The real time that the front end is measuring is 
not the virtual time of the Paris program’s process; instead, it includes the time 
consumed by other processes. 

Such “interference” from other processes can lead to timing variations on the 
order of 15 percent even when the load on the front end is relatively light.The 
longer the code segment being timed, the more likely it is that timing distor¬ 
tions will be introduced by other processes. 

The implementation of the Paris timer suggests some rules of thumb for using it most 
effectively: 

• Use a front end that is as unloaded as possible. 

• Select or manipulate the code segment being timed so that the elapsed time is 
between 1 and 5 seconds. 

• Run the code segment at least five times and use the minimum value reported. 

Some caveats are also in order, both resulting from the timer’s definition of CM idle 
time. Idle time includes only those cycles during which the CM is waiting for an in¬ 
struction from the front end. Consequently, CM active time includes not only those 
cycles during which the CM is performing computations, but also those during which 
the CM is waiting for arguments to an instruction it has received. 
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The caveats are: 

• Expect slightly different CM active times on different front-end models for 
code segments that do not keep the CM 100 percent active. The time the CM 
spends waiting for data to appear is counted as active, but front-end models 
differ in the speed with which they can move data over the bus interface to the 
sequencer. 

• Avoid stopping a process that is being timed. If the process stops while the CM 
is waiting for an instruction, then all is well since the time spent stopped is 
subtracted from the total real time in computing CM active time. However, if 
the process stops while the CM is waiting for data, the time spent stopped is 
counted as active time, which artificially increases the CM execution time re¬ 
corded. 



Part V 

Example Programs 
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Appendix A 

Game of Life 


/* 


This program is a cellular automata model of life and 
death that works with the following rules: 


The cells are set up on a 2-d grid. 


Each cell starts in a randomly chosen life or death 
state. 


Cells compute life and death at every time step n 
for the state at time step n+1. 


Each cell gathers information about its 8 adjacent 
neighbors. North, south, east, west, northwest, 
northeast, southwest, and southeast. 


Each cell uses its neighbor information to determine 
whether or it will live or die depending on the number 
of its neighbors that are alive 


A cell "lives” if it is alive and it has two alive 
neighbor OR if it is dead and has three live 
neighbors. Otherwise it dies. 


♦This parallel version was adapted from a program by 
Craig Reese, IDA/Supercomputing Research Center, 
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Example 24. Conway’s game of life: life.c 


#include <stdio.h> 

#include <cm/paris.h> 

#define PERCENTJDN 40 
#define NORTH 0,CM_upward 

#define SOUTH 0,CM_downward 
#define EAST 1, CM_upward 
#define WEST 1, CM_downward 
#define DEFAULT_GRID_SIZE 512 

int grid_jsize = DEFAULT__GRID_SIZE; 


/**********************************************************/ 
main(argc, argv) 
char *argv[]; 

{ 

int generation; 

CM__geome t ry_i d_t 1 i f e_2d_geome t r y; 

CM_vp_set_id_t life_vp_set; 
unsigned dimensions[2]; 

CM_field_id_t 
alive, 

my_jsum, neighbor_sum, 
temp; 


^*******#**************************************************y 

if (argc-1) 

sscanf(argv[1], "%d" ,&grid_size); 


printf(”Warm booting the CM fflush(stdout); 

CM_init() ; 
printf ("Done\n") ; 
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/* Set up the vp-set in which the the game of life will be 
played */ 

dimensions [0] = grid_size; 
dimensions[1] = grid_size; 

life_2d_geometry = CM_create_geometry(dimensions,2); 
life_vp_set = CM_allocate_vp_set(life__2d_geometry) ; 
CM_set__vp_set(life_vp_set); 


Z**********************************************************/ 

/* allocate storage */ 

/* set up the storage in each cell VP */ 
alive = CM_allocate_heap_field(l); 
my_sum = CM_allocate_heap__f ield(4) ; 
neighbor_sum = CM__allocate_heap_f ield (4) ; 
temp = CM_allocate_heap_field(4); 

/H*********************************************************/ 

/* Randomly choose an initial state */ 


CM_set_context(); 

CM_u_random_lL(alive,l,2) ; 

/H*********************************************************/ 

/* The Main loop */ 

for (generation = 0; generation < 1000; generation**) { 


/* start with everybody */ 

CM__set_context () ; 

/♦initialize fields*/ 
CM_u_move_zero_lL(my_sum, 4) ; 
CM__u__m ov e_z ero _lL(neighbor_sum, 4) ; 
CM__u_jnove__zero_lL (temp, 4) ; 

/* N neighbor */ 

CM_get_from_news_JLL(temp, alive, NORTH, 1); 
CM_u_add_3_3L(my_sum, temp, alive, 4, 4, 1); 
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/* S neighbor */ 

CM_get_from_news_JLL(temp, alive, SOUTH, 1); 
CM_u_add_2_lL(my_sum, temp, 2); 


/* Share results so far. 

Notice that after this call, each cell's east 
neighbor has life information about the cells 
northeast and southeast neighbor encode in 
neighbor_sum. The same is true for the west 
neighbor with respect to each cell northwest and 
southwest neighbors. 

*/ 

CM_u_move_lL(neighbor_sum, my_sum, 2); 

/* E neighbor */ 

CM_get_from_news_lL(temp, neighbor__sum, EAST, 2); 
CM_u_add_2_lL (my__sum, temp, 3) ; 

/* W neighbor */ 

CM_get_from_news_lL(temp, neighbor_jsum, WEST, 2); 
CM_u_add_2_lL(my_sum, temp, 4); 

CM_u_subtract_3_3L(my_sum, my_sum, alive, 4, 4, 1); 

/* whoever has two or three live neighbor lives */ 
/* everyone else dies*/ 

CM__u_move__zero_lL (temp, 4) ; 

CM__load_context (alive) ; 

u ___eq_constant_lL (my__sum, 2 , 4) ; 

CM__store__test (temp) ; 

/* test for three live neighbors */ 

/* inclusive or the result with alive and */ 

/* place back into alive */ 

CM_set__context () ; 

CM _-, u _ e Q_ constant _l L ( m y_sum, 3, 4); 

CM__logior_test (temp) ; 

CM_store_test(alive); 


/* now only those processors who have alive set */ 
/* survived this generation. */ 



Appendix A: Game of Life 


133 


CM_load_context(alive); 

/* output of the grid to your favorite display goes here */ 


} 


} 



Appendix B 

Include File: Macros and Constants 


/* 

This header file defines macros and constants that will 
be used throughout programming examples in the appendixes 
to "Introduction to Programming in C/Paris.". Please 
note that these example macros are not supported as part 
of the CM System Software and Thinking Machines 
Corporation does not warrant them as such. 

*/ 


Example 25. A file included in later examples: macros-and-constants.h 


/* Constant Macros */ 

^define SLEN 23 

#define ELEN 8 

^define IEEE_TOTAL_LENGTH 32 

#define FLEN IEEE_TOTAL_LENGTH 

^define FLENS SLEN,ELEN 

/* Macros with arguments */ 


#define MIN(a, b) (((a)<(b))?(a):(b)) 
#define MAX(a, b) (((a)>(b))?(a):(b)) 

#def ine POWER_OF_TWO(a) (1«(MAX( (a) ,0))) 
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I*****************'**'**'*'**'**'*'*'*'**'**** ***********************/ 

/* Macros that map front-end types onto the CM */ 


/* returns the length of a type in bits */ 

#define CMJTYPELEN(type) (unsigned)(8 * sizeof(type)) 


/* allocates enough room on the CM stack for a type */ 
#define CM^ALLOCATE JTYPE JDN_STACK(type) \ 

CM_allocate__stack_f ie 1 d (CMJTYPELEN (type)) 


/* allocates enough room on the CM heap for a type */ 
#def ine CM^ALLOCATE JTYPE JDN __HEAP (type) \ 

CM__allocatejheapjf ieId (CMJTYPELEN (type)) 


/* this expression comes from the example idioms of */ 

/* the ANSI C draft for determining the offset of a */ 

/* slot in a struct */ 

#define STRUCTJSLOTjOFFSET(type,slotname) \ 

((unsigned)&(((type *)0)->slotname)) 

/* determines the offset, in bits, of a struct slot type */ 
#define CM_STRUCTjSLOT_OFFSET(type,slotname) \ 

(8 * STRUCTjSLOT__OFFSET( type , slotname) ) 

/* returns the CM__field_id_t of the subfield slotname */ 
#define CMjSTRUCTjSUBFIELD(obj,type,slotname) \ 

(CM J i e 1 d_ i d_t) CM__add __o f f s e t __t o__f i e 1 d _ i d \ 

((obj),CM jSTRUCTjSLOTjOFFSET( type,slotname)) 



Appendix C 

Transposing an Array 


/* 

This program transposes a 2-D matrix that is stored in 
the Connection Machine across the processors. The 
storage is such that each virtual processor has one 
element of the array. After some initial set-up and 
send-address calculation, the send instruction actually 
moves each datum to its transposed location. 

V 


Example 26. Transposing an array: transpose.c 


^include <stdio.h> 
#include <cm/paris.h> 


#define FIELDJJSNGTH 8 

/**********************************************************/ 
main(argc, argv) 
char *argv[]; 

{ 


CM_f ield__id_t 

matrix_value, temp, send_address, x__news , y_news; 
CM__geometry__id_t matrix_2d_geometry; 

CM__vp_set__id_t matrix_vp__set; 
unsigned 

matrix_dimensions[2] , 
send_address_length, 
news^coordinate^length, 
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row, column; 

CM_sendaddr__t fe__send_address; 


/***#******************************************************/ 


printf( n Warm booting the CM ..."); fflush(stdout); 
CM_init(); 
printf( n Done\n H ); 

/**********************************************************/ 
/* Set up geometry and vp-set of matrix */ 

matrix_dimensions[0] = 256; 
matrix__dimensions [1] = 256; 
matrix_2d_geometry = 

CM_create__geometry (matrix__dimensions , 2) ; 
matrix_vp_set = CM_allocate_vp_set (matrix_2d__geometry) ; 
CM_set_vp_set(matrix_vp_set); 


/* Allocate storage */ 


/* activate all processors */ 

CM_set__context () ; 

/* Determine how much storage is needed for the send 
address */ 

send_address__length = 

CM^geometry^send^address^length(m a tri x _2d_geometry); 


/* Determine how much storage is needed for the news 
coordinates. We simplify matters by having only one 
length because we know we are dealing with a square 
matrix. In general, we need one length value per 
axis */ 

news_coordinate_length = 

CM_geometry_coordinate__length(matrix_2d_geometry, 0); 
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matrix_value = 

CM_a 11 oc a t e Jieap__fie 1 d (FIELD__LENGTH * 2 + 
send__address_length + 
news_coordinate_length * 2); 

temp = 

CM_add_offset_to__field_id(matrix_value, FIELD_LENGTH); 
send_address = 

CM_add_off set_to__f ield__id(matrix__value, 

FIELD___LENGTH*2) ; 

x_news = CM__add_off set_to__f ield_id(send__address, 
send_address__length) ; 

y_news = 

CM_add_o f f s e t _ t o__f i e 1 d__ i d ( x__n e ws, 
news_coordinate__length) ; 


/**********************************************************/ 


/^Initialize fields*/ 

CM__ u _m° v e__zero_lL (matr ix_value, 

FI ELD__LENGTH * 2 + 
send_address__length + 
news_coordinate_length * 2) ; 


CMjny_news_coordinate_lL (xjnews , 0, 
news_coordinate_length); 
CM_my_news_coordinate__lL (y_news, 1, 
news_coordinate_length); 

/* Seed the triangle where x > y with 1 */ 

CM__u__gt_lL (xjiews , yjne ws, news_coordinate_length) ; 
CM_logand_context_with_test() ; 

CM___u_move_constant_lL(matrix__value, 1, FIELDJLENGTH) ; 

/* reset processor mask */ 

CM__set_context () ; 

/* Seed the triangle where x < y with 255 */ 
CM__u__gt_lL(y__news , x_news , news__coordinate_length) ; 
CM_logand_context_with__test () ; 

CM__ u _m° ve _c on stant_lL(matrix_value, 255, FIELDJLENGTH); 
/* reset processor mask */ 

CM_set__context () ; 
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/**********************************************************/ 
/* Display via text the upper portion of the matrix */ 

printf("A 10x10 region of the matrix before transpose\n"); 
for (row=0; row < 10; row++) { 
for (column=0; column < 10; column++) { 
fe_jsend_address = 

CM_fe_make_news_coordinate (matrix__2d__geometry, 

0, /* axis */ 
column/*news coord*/ 

) ; 

fe_send_address = 

CM__fe__deposit_jiews_coordinate (matrix__2d_geometry, 
fe_send__address, 

1, /* axis */ 
row /* news coord*/); 
printf("%5d u , 

CM_u__r e ad_f r o m__p r o c e s s o r _1L 

(fe_send_address, matrix__value, 

FIELDJLENGTH)); 

} 

printf("\n"); 

} 

printf("Xn"); 

Z*****************************^****************************/ 

/* Build send address 

Notice that to make the address the transpose address 
we need only to switch the news coordinates in the 
send address. The part of the send address that is 
normally for x (in this example) is being filled by 
y__news, namely axis 0. The same is done for the normal 
place for the y coordinate, it is being filled with 
the value of x__news */ 

CM__make_news__coordinate_lL (matrix_2d_geometry, 
send__address, 0, y_news, 
news_coordinate_length); 

CM_deposit_news__coordinate_J.L (matrix__2d_geometry, 
send_address, 1, x_news, 
news^coordinate^length); 
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/**********************************************************/ 

/* We must do the send to the temp field because it is 

illegal to do a send from and to the same field. We put 
temp into matrix_value after the send. */ 

CM_send_with__overwrite__lL (temp, send_address , matrix_value, 
FIELD_LENGTH, CM_no_field); 

CM_u_move_lL(matrix_value, temp, FIELD_LENGTH); 

f**********************************************************/ 

/* Display via text the upper portion of the matrix */ 

printf("A 10x10 region of the matrix after transpose \n"); 
for (row=0; row < 10; row++) { 
for (column=0; column < 10; column++) { 
fe_send_address = 

CM__fe_make_news_coordinate(matrix_2d_geometry, 

0, /* axis */ 
column/*news coord*/ 

) ; 

fe_send_address = 

CM_fe_deposit_news_coordinate(matrix_2d_geometry, 
fe__send_address, 

1, /* axis */ 
row /* news coord*/); 
printf("%5d", 

CM_u_read_from__processor_lL 

(fe_send_address, matrix__value, 

FIELD_LENGTH)); 

} 

printf("\n"); 

} 

printf("\n"); 

/* Deallocate the matrix fields, the vp-set, and geometry */ 

CM_deallocate_heap_field(matrix_value); 

CM__deallocate_vp_set (matrix_vp__set) ; 

CM_deallocate_geometry (matrix_2d__geometry) ; 


} 
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Computing a Histogram 


/* 

This file contains two functions. 


The first function, accumulate_votes, computes the 
histogram of a field that is part of a 2-d vp_set. It 
uses a send with add to do this after it has calculated 
the send address of the destination. 

The second function sets up the 2-d field to be 
histogrammed. To demonstrate one function of the 
histogram, we use a field in which every processor has 
the same value. This will result in a histogram with 
all the votes in one cell. 


To demonstrate the use of two differently shaped vp__sets, 
we have a vp_set for the image (2-d field) and a vp_set 
for the histogram. The method of communication between 
the two is CM send with u add 1L. 


Example 27. Computing a histogram: histogram.c 


^include <stdio.h> 

#include <cm/paris.h> 

#include "macros-and-constants. h" 

#define TARGET VALUE 0 
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/* ======== 


*/ 


accumulate_votes(accumulator^ield, src_field, 
accumulator__length, src_length) 

CM_field_id_t accumulator^ield, src__f ield; 
unsigned accumulator_length, src__length; 

CM_geometry_id_t dest__geometry; 

CM_field__id_t send_address, vote; 
unsigned send_address__length; 


/*===============================================:========* / 

/* Garner information about what is passed in to be used 
later */ 


dest_geometry = 

CM_vp__set_geometry (CM__f ield_vp_set (accumulator^ield)) ; 


send_address_length = 

CM_geometry_send_address_length(dest_geometry); 
CM_set_vp_set (CM_f ield__vp_set (src_f ield)) ; 


/* Allocate temporary fields */ 


vote = 

CM_allocate__stack__f ield(accumulator__length + 
send_address__length) ; 


send_address = 

CM_add_of f set__to__f ield_id (vote, accumulator_length) ; 


/* Initialize values for the temp fields */ 

/* 

We can set vote one and zero out send_address at the 
same time by setting vote to one and specifying a 
length argument that is the sum of accumulator_length 
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and send_address_length. This is possible only 
because these fields, by virtue of their allocation 
using field offsets, are contiguous neighbors inside 
a larger field. 

*/ 

CM _ u _ move ._ constan t__lL(vote, 1 » accumulator_length + 
send_address_length); 


/*=======================================================* / 

/* Calculate the send address of the votes. 


We use the src_field as a basis from which to construct 
an address into another (the histogram) vp_set. The 
histogram vp_set is, by definition, 1-d. Hence, there is 
only one component is the construction of the address. 


CM_make_news_coordinate__lL (dest_geometry, 
send__address, 

/* axis = */ 0, 

/* coord = */ src__field, 
src_length); 


/* Send the votes to there respective accumulator cells. 


The CM__no_field indicates that we do not want the send 
mechanism to waste time notifying us when a message (in 
this case a vote) cannot be delivered due to collision. 
The use of CM_send_with_u_add__lL explicitly indicates the 
method by which collisions are handled. If we used a 
CM__send_JLL (the generic form) we would want to be 
notified in a field when our messages didn't make it. 


V 

CM_send__with_u__add__lL (accumulator_field, 
send__address, 
vote, 

accumulator_length, 

CM_no_field); 





146 


Getting Started in CiParis 


/* Deallocate temp fields */ 

CM_deallocate_jstack_through (vote) ; 

} 

/* The function main does little more than set up image 
and histogram vp_jsets and call accumulate_votes. 

*/ 

main () 

{ 

unsigned int image_dimensions[2]; 

CM_vp_set__id_t image_vp_set; 

CM_geometry__id_t image_geometry; 

CM__f i e 1 d_i d_t i mage; 

unsigned int coordinate_length, image__length; 

unsigned int accumulator_dimensions[1]; 

CM_vp_set_id__t accumulator__vp__set; 

CM__geometry_id__t accumulator_geometry; 

CM_field__id_t tally; 
unsigned int tally_length; 

CM_jsendaddr__t send_address; 

printf("Warm booting the CM fflush(stdout); 

CM_init(); 
printf("Done\n”); 

/ * ========================:=================:==============* / 


/* allocate the image vp-set and fields*/ 
image_dimensions[0] = 256; 
image__dimensions [1] = 256; 
image_length = 8; 
coordinate_length = 16; 
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image_geometry = CM_create__geometry (image_dimensions, 2); 
image_vp_set = CM_allocate_vp_jset(image_geometry); 
CM_set_vp_jset (image_vp_set) ; 
image = CM_allocate_Jieap__field(image__length) ; 
CM_ u _ mov e__ze r °_lL (image, image_length) ; 


/* 


*/ 


/* allocate the accumulator vp-set and fields*/ 
accumulator_dimensions [0] = CM_physical__processors_JLimit; 
tally_length = 32; 
accumulator_geometry = 

CM__create_geometry (accumulator__dimensions, 1) ; 
accumulator__vp_set = 

CM__allocate_vp_jset (accumulator_geometry) ; 

CM_set_vp_set (accumulator_vp_jset) ; 

tally = CM_allocate_heap_field(tally_length); 

CM__u_move__zero_lL (tally, tally_length) ; 




/* Main section */ 


*/ 


/* note we are in the image__vp__set for accumulation */ 

CM_set_vp_set (image__vp_set) ; 

CM_set_context(); 

/* set image to a constant value */ 

CM_u_move_constant__lL(image, TARGET_VALUE, image_length) ; 

accumulate_votes(tally, /* accumulator^ield */ 
image, /* src_field */ 
tally__length, /* accumulator_length */ 
image_length /* src_length */ 

) ; 


/* now we switch to the accumulator__vp__set */ 
CM_jset_vp_set (accumulator__vp_set) ; 


send_address = 

CM__f e_make_jriews__c oordinate(acc umu 1 a t o r_geome t ry, 
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/* axis = */ 0, 

/* coord = VTARGET_VALUE) ; 

printf("Total in proc[%d] = %d: Total image size %d\n", 
TARGET_VALUE, 

CM_ujread_f romjprocessor_lL (send_address, 
tally, 

tally_length), 

image_dimensions[0]*image_dimensions[1]); 


} 



Appendix E 

Particle Problem 


This file contains a collection of functions that model 
a very simple Newtonian physics particle system in 
parallel. The purpose is to demonstrate how to set up and 
use more than one vp-set on the Connection Machine using 
C/Paris. It also demonstrates a method for generating 
send addresses from computed data. 

The physics of this toy problem is the ultra-simple 
portion. What we loosely attempt to model is the 
behavior of particles under the laws of Newton. We 
roughly approximate the interaction of acceleration and 
velocity on the position of a particle over time. We 
divide time up into discrete steps and answer the 
question, What happens to each particle in this time 
step? 


The implementation of this particle system consists 
primarily of two entities: a vp-set to hold the 
information about the particles and a vp-set to hold the 
display of those particles. 


The particle vp-set has each processor contain all the 
information about one particle. This information 
consists of position, velocity, and acceleration vectors 
and a color. The particles are initialized with 
contained random values. The particle vectors are then 
updated according to some arbitrary rules (roughly 
Newtonian—see update_part for details) for every time 
step. 
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The image vp-set, which holds the plotted positions of 
each particle, can used to display the entire set of 
particles at every time step. 

The event loop is as follows: 

-Initialize particle 
-for every time step 

- draw current position of the particles 

- update particle information 


Example 28. A Newtonian particle problem: cm-particles.c 


#include <stdio.h> 

#include <cm/paris.h> 

#include "macros-and-constants.h" 


/**********************************************************/ 
/* Constant Definitions */ 

#define NUMJTIME_STEPS 2000 

#define DEFAULT_SCREEN 1024 

f**********************************************************/ 
/* 

This structure is mapped into the memory of every 
processor that contains a particle. Throughout this 
file we will use accessor functions found in the header 
file "macros-and-constants.h". For example, 

CM__STRUCT__SUBFIELD (part icles , Particle, x) 

returns the field-id for the slot 'x' in the particles 
field. 

*/ 
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typedef struct { 
char 
color; 
float 

x, y, /* position */ 
vx, vy, /* velocity */ 
ax, ay; /* acceleration */ 

} Particle; 

/* the size of one side of the 2-d image. */ 
int image_edge__size = DEFAULT__SCREEN; 

/**********************************************************/ 

/**********************************************************/ 

/* 

This function takes the current position and color of 
each particle (in parallel, of course) and plots each 
particle's color at the x and y location specified. A 
send is used after a send address is generated from the 
current location information. 

*/ 


void 

draw_particles(image, particles) 

CM_field_id_t image, particles; 

{ 

CM_field__id__t sx, sy; 
unsigned len; 

CM_geometry_id_t geometry; 




/* figure out the size of the coordinate field */ 

/* notice we add 1 for the sign bit */ 
geometry = CM_vp_set_geometry(CM_f ield__vp_jset (image)) ; 
len = MAX(CM__geometry_coordinate_length (geometry, 0) , 
CM__geometry__coordinate_length (geometry, 1)) + 


i; 
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/* 

Allocate temp fields to hold the integer values of the 
current x and y locations 

*/ 

sx = CM_allocatejstack_f ield( 2 * len) ; 
sy = CM__add_offset_to_field__id(sx, len); 
CM_u_jnove__zero_JLL(sx, 2*len) ; 

/**********************************************************/ 
/* 

Convert the floating-point values to signed integers. 

They are needed in this form for the next call. 

*/ 

CM_s_f_fl°or_2_2L(sx, 

CM_STRUCT_SUBFIELD(particles, 

Particle,x), 
len, FLENS); 

CM_s_f_f loor_2__2L (sy, 

CM_STRUCT_SUBFIELD(particles, 

Particle,y), 
len, FLENS); 

/**********************************************************/ 
/* Call draw_point. 


The source code for drawjpoint resides in draw-point.c (on¬ 
line) and is described in Chapter 7 of this manual. 


} 


drawjpoint(image, sx, sy, 

CM_STRUCT_SUBFIELD(particles, 

Particle,color), 
len, CMJTYPELEN(char)); 


/**********************************************************/ 

/* 

This function sets up the initial state of the particles. 
All it does is put the particle in the middle of the 
screen (position) and specify that the velocity will be 
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between +- image_edge_size/32 and the acceleration will 
be between +- image_edge_size/128 and the color to 1. 

*/ 

void 

seed_parts(particles) 

CM_field_id_t particles; 

{ 

CM_field_id_t 

tx, ty, tvx, tvy, tax, tay, tcolor; 


/* 

Make simple references to each subfield. 

*/ 

tx = CM_STRUCT_SUBFIELD(particles,Particle,x); 

ty = CM_STRUCT_SUBFIELD(particles,Particle,y); 

tvx = CM_STRUCT_SUBFIELD(particles,Particle,vx); 

tvy = CM_STRUCT_SUBFIELD(particles,Particle,vy); 

tax = CM_STRUCT_SUBFIELD(particles,Particle,ax); 

tay = CM__STRUCT_SUBFIELD(particles , Particle, ay) ; 

tcolor = CM_STRUCT_SUBFIELD(particles,Particle,color); 

/* initialize start location to center of screen */ 
CM_f_move_constant_lL (tx, (double) (image_edge_size»l) , 
FLENS); 

CM_f_move_constant_lL (ty, (double) (image_edge_size»l) , 
FLENS); 


/**********************************************************/ 
/* the velocity will be between +- image_edge_size/32 */ 
CM_f_random_lL(tvx, FLENS); 

CM__f_subtract__constant__2_lL(tvx, (double) 0.5, 

FLENS); 


CM_f_multiply_constant_2_lL(tvx, 

(double) (image_edge_size»5) , 
FLENS); 


CM_f__random_lL (tvy, FLENS) ; 

CM_f_subtract__constant_2_lL(tvy, (double) 0.5, 
FLENS); 

CM_f __mul t iply__cons tant_2_lL (tvy, 

(double) (image_edge_size»5) , 

FLENS); 
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/ft*********************************************************/ 

/* the acceleration will be between 
+- image_edge_size/128 */ 

CM_f__random_JLL (tax, FLENS) ; 

CM_f__subtract_constant_2_lL(tax, (double) 0.5, 

FLENS); 

CM_f_multiply__constant_2_lL (tax, 

(double) (image_edge_jsize»7) , 

FLENS); 

CM_f_random_lL(tay, FLENS); 

CM_f_jsubtract_constant_2_lL(tay, (double) 0.5, 

FLENS); 

CM__f_mult iply__constant_2_lL (tay, 

(double) (image_edge_jsize»7) , 

FLENS); 


f**********************************************************/ 

CM_u__move__constant__lL(tcolor, 1, CM_TYPELEN(char)) ; 

} 

/h*********************************************************/ 

/**********************************************************/ 

/* 

This function updates the particles as follows: 

x = x + vxx(n) is vx(n-l)*t + x(n-l) where t=l 
y = y + vy ditto 

vx = vx + ax vx(n) is ax(n-l)*t + vx(n-l) where t=l 
vy = vy + ay ditto 

ax = ax*0.66 arbitrarily scale back the acceleration. 

ay = ay*0.66 ditto 

color++; color indicates age 

For every particle that has gone off the edge of the image, 
the context is set and seedjparts is called. 


*/ 

void 

update_part(particles) 

CM_field_id_t particles; 

{ 

CM field id t 
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tx, ty, tvx, tvy, tax, tay, tcolor, 
saved_context, reseed; 


/* make simple references to each subfield. */ 
tx = CM__STRUCT_SUBFIELD (part icles , Particle, x) ; 
ty = CM_STRUCT_SUBFIELD(particles,Particle,y); 
tvx = CM_STRUCT_SUBFIELD(particles,Particle,vx); 
tvy = CM_STRUCT_SUBFIELD(particles,Particle,vy); 
tax = CM_STRUCT_SUBFIELD(particles,Particle,ax); 
tay = CM_jSTRUCT__SUBFIELD(particles , Particle, ay) ; 
tcolor = CM_STRUCT_SUBFIELD(particles,Particle,color); 




/* update components */ 

CM_f_add__2_lL (tx , tvx, FLENS) ; 

CM__f_add_2_lL(ty, tvy, FLENS); 
CM_f_add_2_lL(tvx, tax, FLENS); 

CM_f_add_2_lL(tvy, tay, FLENS); 
CM_f__multiply_constant_2_lL(tax, (double) 0.66, 
FLENS); 


CM_f_multiply_constant_2_lL(tay, (double) 0.66, 
FLENS); 


CM_u_add_constant_2_lL(tcolor, 16, CM_TYPELEN(char)); 


/**********************************************************/ 
/* check to see if any particles are out of bounds */ 

/* if they are reseed them and proceed */ 
saved_context = CM_allocate_stack_field(2); 
reseed = CM_add_of f set__to_f ield_id (saved_context, 1); 
CM _ u _move_zero__lL(saved_context, 2); 

CM_store_context(saved_context); 

CM_f_lt_constant_lL(tx, (double)0, FLENS); 

CM__logior__test (reseed) ; 

CM___store_test(reseed) ; 

CM_f_lt_constant_lL(ty, (double)O, FLENS) ; 

CM__logior_test (reseed) ; 

CM_jstore_test (reseed) ; 
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CM_f_gt_constant_lL(tx, (double)image_edge_size, 
FLENS); 

CM__logior_test (reseed) ; 

CM__store_test (reseed) ; 

CM__f_gt_constant__lL(ty, (double) image__edge_size, 
FLENS); 

CM__logior__test (reseed) ; 

CM_jstore_test(reseed) ; 


/* set context and call seed_parts to reseed those who */ 

/* are off the edge */ 

CM_load__context(reseed) ; 
seed_parts(particles); 

CM__load_context (saved_context) ; 

CM_deallocate_stack_through(saved_context) ; 

} 

S**********************************************************/ 

main(argc, argv) 
int argc; 
char ♦♦argv; 

{ 

CM_fie 1 d_id__t particles, image; 

CM_geometry_id_t image_geometry; 

CM_vp_set_id_t image_vp_set, particle_vp_set; 
unsigned dimensions [2] , gen, color__length; 


/**********************************************************j 

/* allocate the particles field */ 


particle_vp_set = CM_current_vp_set; 

particles = CM_ALLOCATE_TYPE_ON__HEAP(Particle) ; 

CM_u_move__zero_lL (particles , CMJTYPELEN(Particle)) ; 
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/*****#***#***#**#*****************************************/ 
/* allocate the image field */ 


color_length = CM_TYPELEN(char); 
dimensions [0] = image__edge__size; 
dimensions[1] = image__edge_size ; 

image_geometry = CM_create_geometry(dimensions, 2); 
image_vp_set = CM__allocate__vp_set (image_geometry) ; 


CM__set_vp_jset (image__vp_jset) ; 

image = CM_allocate_heap_field(color_length); 

CM__ u _m° v e_ z e ro _lL (image, color_length) ; 


/**********************************************************/ 
/*The main loop */ 


CM_set_vp_set(particle_vp_set) ; 
seed_parts(particles); 
draw_jparticles(image, particles); 

for (gen = 0 ; gen < NUMJTIMEJSTEPS ; gen++) { 
update_part(particles); 
draw_particles(image, particles); 

} 




Appendix F 

Lines of Sight 


This program computes a line of sight from an arbitrary 
point in space to each point of a slice of terrain to 
calculate which of the points are in view. We consider 
the 2-d case: 
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* - Eye point 

X - Terrain point along n 

He - Height of the eye 

Ht(n) - Height of the terrain at point n. 

De - Displacement of the eye relative to the first point. 
This value is negative if the eye is to the left 
of the eye point. 
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In this case, we calculate the angle a(n) in the following 
triangle for every point X(n) such that: 


angle a(n) = arctan( (n - De) / (He - Ht(n)) ) 
* — e y e point 


He - Ht(n) 


\ / • 
a(n) . 


. d(n) 


b(n) c(n)_. 

|.| .X(n) — terrain point 


n - De 


We consider a terrain point X(n) to be out of view when 
the angle a(n) is less than the largest angle 
encountered thus far between a(0) and a(n). 

a(n) < MAX( a(0)..a(i) ) where 0 < i <= n 


*/ 


Example 29. Lines of sight: line-of-sight.c 


#include <stdio.h> 

#include <cm/paris.h> 

^include "macros-and-constants.h" 

#define EYE_DISPLACEMENT -1000.0 /* De */ 

#define EYE_HEIGHT 1500.0 /* He */ 
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#define TERRAIN_HEIGHT_LIMIT750.0 /* limit random */ 

/* height */ 


/**********************************************************/ 

typedef struct { 
float 
angles, 
heights, 
displacements, 
side_ratios, 
running_maxs; 
short unsigned int 
news_coords; 

} Terrain; 

/**********************************************************/ 


main () 

{ 

CM__f ield_id_t 

terrain_vars, /* all the variables for 

* the sight calculation */ 

angle, 
height, 
displacement, 
side_ratio, 
news__coord, 
running_max, 

in__sight; /* a 1-bit variable to mark 

* which processor is in 

* sight and which isn't 

*/ 


^g*********************************************************/ 

/* Begin Code */ 

CM_init(); /* always the first thing!*/ 

CM__set_context () ; 
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/**********************************************************/ 
/* allocate the sight_vars field */ 

terrain_vars « CM_ALLOCATE_TYPE_ON_HEAP(Terrain); 
CM _> u _ move — zero —(terrain_vars, CMJTYPELEN(Terrain)) ; 

in__sight = CM__allocate_heap_field(l) ; 

CM_u_move_zero__lL (in_sight, 1) ; 

/* initialize locations */ 


/* first we clear out the Terrain struct on the CM */ 
CM__u_move_ze r o_lL(terrai n _vars, CM_TYPELEN(Terrain)); 


/* now for convenience sake we provide aliases for 
* the fields in terrain_vars. Spare us some typing */ 
angle = 

CM_STRUCT_SUBFIELD (terrain_vars , Terrain, angles) ; 
height = 

CM_STRUCT_SUB FIELD(terrain_vars,Terrain,heights); 
displacement= 

CMJSTRUCTJSUBFIELD(terrain_vars,Terrain, 
displacements); 
side_ratio = 

CM_STRUCT__SUBFIELD(terrain_vars,Terrain,s ide__rat ios); 
running_max = 

CM_STRUCT__SUBFIELD(terrain_vars.Terrain,running_maxs); 
news^coord = 

CM_STRUCT__SUBFIELD (terrain_vars , Terrain, news_coords) ; 


/* assign the coordinate each processor */ 
CM__my__news__coordina t e__lL (news_coord, 0, 
CM_TYPELEN(short)); 


/* make the news coordinate usable in a float form. 

* this will serve as the displacement away from the 

* beginning of the terrain */ 

CM_f_u_f loat_2_2L(displacement, news_coord, 
CMJTYPELEN(short), FLENS); 
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/* give us some random scale factor for height */ 
CM_f_random_lL(height, FLENS); 

/* scale height with random value */ 

CM_f jraultiply^constant_2_lL(height, 

(double) TERRA I N_HE I GHT__L IMIT, 

FLENS); 


/**********************************************************/ 
/* calculate angle */ 

/* 

/ (terrain, displacement - EYE__D IS PLACEMENT) \ 

angle = arctan | - 

\ (EYE_HEIGHT - terrain.height) / 

V 


CM_f_subtract_constant_2__lL (displacement, 

(double) EYE__DI SPLACEMENT, 

FLENS); 

CM_f__subf rom__constant_2_lL (height, 

(double) EYE^HEIGHT, 

FLENS); 

/* since we constrain height to be non-zero by our choice 
of constants we don't need to worry about division by 
zero */ 

CM_f_divide_3_lL(side_ratio, displacement, height, 

FLENS); 

CM_f_atan_2_lL(angle, side_ratio, FLENS); 


/* scan with maximum */ 


CM__scan_with_f_max_lL(running__max, angle, 0, FLENS, 
CM__upward, /* direction */ 

CM_inclusive, /* include self */ 

CM_none, /* do not segment scan */ 
CM_no_field /* no segment bit */ 

) ; 
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/* determine which points are in view */ 


/* set in_sight for those processors whose angle is 
equal to the running max */ 

CM__f_eq_lL(angle, running__max, FLENS) ; 

CM_jstore_test (in_jsight) ; 


} 


Appendix G 

Drawing Lines 


/* 

For each active processor, draw_line draws a line into the 
image field between the locations specified in the start 
vector (sx, sy) and the end vector (ex, ey). This example is 
a simplified version of the line drawing function 
CMSR_f__draw_line found in the *Render parallel rendering 
library. 

THE ALGORITHM 


The algorithm we use to is a variant of the Bresenham 
algorithm, redesigned slightly to work on the Connection 
Machine. In short, the program simply takes the starting 
point of the line and uses a Digital Differential Analyzer 
(DDA) to the compute the location of the pixels on the line 
to the end pixel. A DDA is real nothing more than a process 
that, given some initial conditions, allows us to to perform 
the same steps on all lines to interpolate the in-between 
locations of the pixels. This is the type of situation that 
we strive for when programming the CM because we have broken 
the problem into very small pieces and use the scan 
mechanism to calculate in parallel. 


The DDA dictates that we have one axis act as the discrete 
independent variable. Or in English, one variable must be be 
incremented by one and the other by a differential form. 
Since we are drawing straight lines, the form is nothing 
more than the slope of the line. We will use x as the 
discrete independent variable and increment it by one and 
increment y by the slope. This is only possible after we 
apply the proper initial conditions. 
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The conditions we must have are: 

1) dx, the difference in x, must be larger than dy. 

2) dx must be positive. 

Obviously this restricts us to a limited set of lines if we 
do not provide a method to make all lines conform to this 
specification. As it turns out, conforming requires little 
more than a few comparisons and a few one-bit flags to keep 
track of things. 


THE IMPLEMENTATION 


One important aspect of CM programming is\ keeping all the 
processors busy. In many cases it pays to look at a problem 
in terms of decomposition. In this case we can decompose 
the lines we want to draw into constituent pixels. Once we 
have the lines decomposed into pixels we are able to operate 
on them in parallel performing the necessary interpolations. 
Once interpolated, each point is rendered independently. 


When we enter draw_line we enter it in what is assumed to be 
the vp-set that contains the start and end vertices. From 
these vertices, we compute the other information we need 
such as dx, dy, slope, and various flags as mentioned above. 


Once this information is calculated, we need to create a 
place for the pixels to be calculated. We accomplish this 
by creating a new vp-set with size np, where np is the total 
number of pixels in all the lines. This allows us to 
associate one virtual processor with each point. The pixel 
vp-set will be divided as a contiguous one-dimensional array 
of line segments. It will be zero-based so that line 0, 
which has, say, 100 pixels, will take up the first 100 pixel 
vp's (0-99), line two which has, say, 30 pixels, will be in 
the next 30 vp's (100-129), and so on. 

Next we transfer the data to the pixel vp-set from the line 
vp-set through a send instruction. We only need one send 
because the necessary data will be in one big field divided 
up into separate little fields. When at all possible, this is 
the preferred method. Reducing the number of sends enhances 
program performance. 
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Before we can do a send, however, we must first calculate 
the send address. Since we have contained dx > dy (with 
exceptions kept track of with flags) and use a DDA to 
calculate the intermediate, we know that the exact number of 
pixels between the start and the end point is dx -f 1. In 
this program we will simply use dx**. If we scan with add 
the values in dx we have running sum of values that also 
serve as the send address. To make this a general purpose 
algorithm for drawing lines, we allow a hypothetical user to 
call this paris function in an arbitrary vp-set of any legal 
dimension. Before we do the scan we must temporarily change 
the shape of the vp-set to make it 1-d. Then do the scan so 
that each send address we have calculated corresponds to a 
real send address and that all send addresses, with segment 
length included, are contiguous. Accordingly the pixel vp 
set must be in send ordering. 


Now we interpolate. In order to do that we must copy the 
sent value of the starting point to the rest of the 
processors in the segment with scan with copy. We then scan 
with add the values of the slope and the unity (a temp field 
with all ones). Then we add the the scanned slope values to 
y and the scanned unity values to x. 


We now have a computed point in each processor and can thus 
call draw_point with these values to update the image field. 


We are done. 


** Using dx as opposed to dx + 1 differs only in that the 
end point will not be drawn (useful if one desires connected 
lines). To include the end point, simply add 1 to dx after 
the slope calculation. As an exercise the reader can change 
the code so that it will draw the end point optionally based 
on the true/false condition of either a front-end value 
specified as an argument to the function or specified in a 
one bit field passed in as an argument. In either case be 
careful to maintain the conditions for the DDA and not 
change the line that is drawn. The only pixel that should 
change is the end pixel and no other. Or you can use the 
Connection Machine graphics library *Render. 


*/ 
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Example 30. Drawing lines: draw-line.c 


#include <stdio.h> 

#include <cm/paris.h> 

#include "macros-and-constants.h M 

/**********************************************************/ 


/* 

the following structure will be used to transfer all the 
necessary information between the line vp-set and the 
pixel vp-set 
*/ 

typedef struct { 
float 

tsx_s, tsy_s, slope_s; 
unsigned char 

tcolor_s, flags_js; 

} Line__Fields_JPacket; 

typedef struct { 

unsigned long int 
send_address_s; 
float 

dx_j3, dy__s, tex_s, tey_s, temp__x_s, temp_y__s; 

} Temp_Fields; 

unsigned 

round__to_nearest__virtual_machine__size (num_of_vps) 
unsigned num_pf__vps; 

{ 

unsigned exp; 


exp = CM_physical__cube_address_length; 
while (POWER_OF_TWO(exp) < num_of_vps) 
exp++; 


return POWER_OF_TWO(exp); 


} 
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/**********************************************************/ 

void 

draw__line(image, sx, sy, ex, ey, color, color_length) 
CM__field_id_t 

image, /* The image field the lines render into. 
Assumed to have at least the number of 
bits as color_length */ 
sx, sy, /* The starting coords (integers) */ 
ex, ey, /* The ending coords (integers) */ 
color; /* The field which has a value for the 
color of the line */ 
unsigned color_length; 


{ 

/**********************************************************/ 
/* local variables and fields*/ 

/* local to the current vp-set when function called */ 


CM_vp_set__id__t 

line_vp_set; /* the vp-set upon entering */ 


/* these two geometries are for temporarily changing the 
line vp-set geometry to a one-dimensional and back */ 
CM_geome t ry_id_t 

saved_geometry, one_d_line_geometry; 

CM_f ield__id_t 
line__packet, 

tsx,tsy, /* the info from the lines. Declared */ 

tex,tey, /* locally so we can bash them. */ 

tcolor, 

slope, /* The slope of each line */ 

reverse_p, /* Set to TRUE if the coords have been 
reversed (X for Y) */ 

/* these fields will be part of flags__s */ 
saved__context, /* the context bit upon invocation */ 
ends_sw, /* set true rf the start and end points 

of the lines are switched */ 
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segment_start_bit, /* this field is used to mark the 
beginning of a segment, in the 
line vp-set it is set to one*/ 


temp_packet, /* packets for the above structs */ 

/* that will live on the CM */ 
dx,dy, /* the difference along each axis. */ 

send_address, /* a pointer to the starting processor 
of the allocated segment for each 
line */ 


temp_x, temp__y; 


unsigned 

pixel__sum, /* the total number of pixels */ 

/* for all the lines */ 

linejprocs; /* the total number of processors in 
the line vp-set */ 

/**********************************************************/ 
/* local variables and fields */ 

/* local to a vp-set allocated by this function */ 

CM_field_id__t /* these are all the fields allocated 
in the allocated vp-set 
they mirror the ones above */ 
alloc_line__packet, 
alloc_x, 
alloc_y, 
alloc_color, 
alloc^slope, 
alloc__reverse__p, 
alloc_saved__context, 
al loc__ends_sw, 
alloc_segment_start_bit, 
alloc_temp; 


/* these variables are needed to create a new vp-set */ 


/* describes the only axis for the new vp-set */ 
CM_axis_descriptor_t 
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allocated_descriptor__array [1] ; 
struct CM_axis__descriptor allocated__descriptor; 

CM_geometry_id_t /* the allocated vp-set's geometry*/ 
allocated__geometry; 

CM__vp_set_id__t 

allocated_vp_set; /* the allocated vp-set */ 

/* This variable holds the number of bits that will be 
copy scanned in the new vp-set */ 
unsigned copy_size; 

/* END DECLARATION*/ 

/******X****X****X********X*****X****X******XX********X****/ 

/**********************************************************/ 
/* BEGIN CODE */ 

/* save the current vp-set */ 
line_vp_set = CM_current_vp_set; 


/**********************************************************/ 
/* Allocate space and initialize*/ 

line_packet = 

CM_allocate_stack_field(CM_TYPELEN(Line_Fields_Packet)); 


CM_u_move_zero_lL(line_packet, 

CM_TYPELEN(Line__Fields__Packet)) ; 

tsx = CM_STRUCT__SUBFIELD (line_packet, 
Line__Fields_Packet, tsx__s) ; 
tsy = CM_STRUCT_SUBFIELD (line_j?acket, 
Line_Fields__Packet, tsy__s) ; 
slope = CM__STRUCT_SUB FI ELD (line_packet, 
Line_Fields_Packet, 
slope_s); 

tcolor = CM_STRUCT_SUBFIELD (line_packet, 
Line__Fields__Packet, 
tcolor_s); 

reverse__p = CM_STRUCT__SUBFIELD (line__packet, 
Line__Fields__Packet, 
flags_s); 
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/* since reverse_p is actually 8 bits (=sizeof(char)) 
we can use the extra space for more 1-bit flags */ 
ends_sw = 

CM_add_off set__to_f ield_id(reverse_p, 1) ; 


saved_context = 

CM_add_off set__to_f ield_id(reverse_p, 2) ; 


segment_start Jbit = 

CM_add__off set_to__f ield_id(reverse_j), 3) ; 


/**********************************************************/ 
temp_packet = 

CM_allocate_stack_f ield(CM__TYPELEN(Temp__Fields)) ; 


CM_u_move_ z er o __lL (temp_packet, 

CM__TYPELEN(Temp__Fie1d s)); 

» 

dx = CM_STRUCT_SUB FI ELD (temp__packet, Temp__Fields , dx_s) ; 
dy = CM__STRUCT__SUBFIELD (temp__packet ,Temp_Fields , dy__s) ; 
tex = CM_STRUCT_SURFIELD (temp_packet,Temp_Fields,tex_s); 
tey = CM_STRUCT_SUBFIELD (temp_packet,Temp_Fields,tey__s); 
temp_x = CM_STRUCT_SUBFIELD (temp_j?acket, 

TempJFields, temp_x__s) ; 
temp_y = CM_STRUCT__SUBFIELD (temp_j?acket, 

Temp_Fields, tempjy_s); 

send_addres5 = CM__STRUCT___SUBFIELD (tempjpacket, 
Temp_Fields, 
send_address__s) ; 

/* save context */ 

CM_store_context (saved__context); 


/**********************************************************/ 
/* copy start and end into local values */ 


CM_f_move_JLL (tsx, sx, FLENS) ; 

CM_f_move_lL (tsy, sy, FLENS); 

CM_f_move__lL (tex, ex, FLENS); 

CM_fjmove__lL (tey, ey, FLENS); 

CM__u_move_lL (tcolor, color, color_JLength) ; 
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/* determine lengths */ 

CM_f__subtract__3__lL (dx, tex, tsx, FLENS) ; 
CM_f_subtract_3_lL (dy, tey, tsy, FLENS); 


/* if |dy| greater than |dx| reverse the coords 
and set reverse_p */ 

CM_f_abs_2_JLL (temp__x, dx, FLENS) ; 

CM_f_abs_2_lL (temp_y, dy, FLENS); 

CM__f_gt_lL (temp_y, temp_x, FLENS) ; 
CM_logand__context_with_test (); 

CM - u - move _ constant _ 1L (reverse_p, 1, 1); 

CM_swap_2_lL (dx, dy, FLEN); 

CM_swap_2_lL (tsx, tsy, FLEN); 

CM__swap_2_lL (tex, tey, FLEN); 

CM_load_context(saved_context); 

/**********************************************************/ 
/* Compute the slope.*/ 

/* The slope will always be between [-1,1] because dx >= dy 
This will only work for lines where dx is non-zero. */ 

CM_f_divide_3_lL(slope, dy, dx, SLEN, ELEN); 

/*********************X************************************/ 

/* Make sure dx is positive. */ 

/* It will be used to figure out how many vp's to allocate. 
Thuss if it is less than zero we negate it. 

If we negate dx we must also negate the slope and set a 
flag so we know where we have turned things around. */ 

CM__f__1 t__zero__lL (dx, FLENS); 

CM_logand_context_with_test () ; 

CM_f_negate_l__lL (dx, FLENS); 

CM_f_negate_l_JLL (slope, SLEN, ELEN); 

CM__store__context (ends_sw) ; 

CM__load__context (saved_context) ; 
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/**********************************************************/ 
/* Convert dx to integer value from float. */ 

CM_s_f_floorJ2_2L(temp_x, dx, FLEN, FLENS); 
CM_u_move__lL(dx, temp_x, FLEN) ; 

/**********************************************************/ 
/* Initialize send__addresss to the processors to be 
allocated 


To make this function truly general we will allow it 
to called with any valid geometry. To make our scheme 
of having a pointer in each line processor that 
points to where it's segment in the pixel vp-set 
begins work, we must 

1) save the current geometry of the line vp-set. 

2) figure out how big it is 

3) create a new geometry of the same size BUT with 
only one dimension (so the scan will work) 

4) change the geometry of the line vp-set to the new 
one-d geometry 

5) perform an scan with unsigned add 

6) restore the original geometry of the line vp-set 


saved_geometry = CM_vp_jset__geometry(line_vp_set) ; 
line_procs = 

CM_geometry_total_processors (saved__geometry) ; 
one_d_line_geometry = CM_create_geometry(&line_j?rocs,1); 

CM__set_vp_set__geometry (line_vp_set, one_d_line_geometry) ; 
CM_scan__with_u_add_lL (send_address, dx, /* axis */ 0, 
FLEN, 

CM_upward, CM___exclusive, 

CM_none, CM_no__f ield) ; 

CM__set_vp_set_geometry (line__vp_set, saved_geometry) ; 
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/**********************************************************/ 
/* Allocate a vp-set for the pixels 


Allocate a segment of processors for each line to be 
drawn so that there is one virtual processor per pixel. 
Thus length of each segment equals the number of pixels 
in the line and the total number of processors needed 
is the total number of pixels. 


Since we have contained the slope in the interval [-1,1] 
the number of processors we need to allocate for each 
line is in dx. Thus a global sum of dx determines how 
many total processors we need. 

*/ 

/**********************************************************/ 

/* Compute the total length which is the size of the 
new vp-set. */ 


pixel_sum = CM_global_u_add_lL (dx, FLEN); 


/a*********************************************************/ 

/* Allocate the vp-set with a detailed geometry 


We need a detailed geometry because it is the only way 
to specify that we want send address ordering. 

We want send address order because we are going to 
send the line information from the line vp-set to 
this vp-set using an address computed by a running sum. 
If it were news ordering, which is 

the default, then we would have to first deposit the 
news coordinate into the send address. Instead we do 
less overhead on the CM and more on the front end 
which in this case makes more sense. 

The function round_to_nearest_virtual_machine__size is 
used because the vp mechanism requires a vp-set that is 
a power of two in size. The function is defined above. 


In this situation these are the slots in the 
CM__axis_descriptor structure that we need to fill 
with something non-zero. 
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allocated__descriptor. ordering = CM_jsend_order; 
allocated_descriptor.length = 

round_to_nearest__virtual_machine_size(pixel_jsum); 

/* these slots are cleared because they need to be */ 
allocated_descriptor.weight * 0; 
allocated_descriptor.on_chip_bits = 0; 
allocated_descriptor. on_chip_jpos = 0; 
allocated_descriptor.off_chip_bits = 0; 
allocated__descriptor. of f__chip__pos = 0; 
allocated_descriptor.vp__ratio = 0; 
allocated_descriptor.vp_ratio_multiplier = 0; 
allocated_descriptor. address__length = 0; 
allocated__descriptor. virtual_bitmask = 0; 


allocated__descr iptor__array [0] = 
&allocated_descriptor; 


allocated_geometry » 

CM_create_detailed__geometry 
(allocated_descriptor_array, 1); 

allocated_vp_jset = 

CM__allocate_vp_set (allocated__geometry) ; 


/************, ***************** *************** ****** + * + *****/ 
/* Initialize the allocated processors PART I */ 


CM_set__vp_set (allocated__vp_set) ; 
CM__set__context () ; 


alloc__line_packet = 

CM_allocate_stack_field(CMJTYPELEN(Line_Fields_Packet)); 


CM_u__move__zero_lL (alloc_line_packet, 

CMJTYPELEN(Line_FieldsJPacket)); 

alloc_x = CM_STRUCT_SUBFIELD (alloc__line__packet, 
Line_Fields_Packet, 
tsx_s); 
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alloc_y = CM_STRUCT_SUBFIELD (alloc_line__packet, 
Line_Fields_Packet, 
tsy_s); 

alloc_slope = CM_STRUCT_SUBFIELD (alloc_line_packet, 
Line__Fields_Packet, 
slope_s); 

alloc_color = CM_STRUCT_SUBFIELD (alloc_line_jpacket, 
Line_Fields_Packet, 
tcolor_s); 
alloc_reverse_p = 

CM__STRUCT__SUB FI ELD (alloc_line_packet, 
Line__Fields_Packet, 
flags_s); 
alloc_ends_sw = 

CM_add_offset_to_field_id(alloc_jreverse_p,1); 


alloc_saved_context = 

CM_add_off set — to_f ield__id(alloc_jreverse_p,2); 

alloc_segment_start_bit = 

CM_add_offset_to_field__id(alloc_reverse_p,3); 


/* Create one temporary variable for various and sundry 
purposes */ 


alloc_temp = 

CM_allocate_stack_field(FLEN); 
CM_u_move__zero_lL (alloc_jtemp, FLEN) ; 


/**********************************************************/ 
/* Set the context flag in just those processors that 
will really represent pixels. The rest turn off and 
save the result in alloc__saved_context. 

*/ 

CM_my_jsend_address_JLL(alloc__temp) ; 

CM_u__lt_constant_lL (alloc__temp, pixel_sum, FLEN) ; 
CM__logand_context_with__test () ; 

CM_store_context (alloc_saved__context) ; 
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/**********************************************************/ 
/* Initialize each segment start processors 

The values in the field send__address points to the first 
processor in each line segment of the allocated pixel 
vp-set. We therefore use it as a send address to move the 
computed line data to the pixel vp-set. Since the data is 
constituent contiguous fields of a larger field, we only 
have to perform one send. This is a major performance 
gain. 


*/ 


CM_set_vp_jset (line_vp_set) ; 

/* set the segment start bit on in all processors that 
will send line values so we know were the segments 
start in the pixel vp-set. */ 

CM__u__ m ove_ con st a nt__lL (segment_start__bit, 1, 1) ; 

CM_send_lL (alloc__line_packet, send_address , line__packet, 
CMJIYPELEN(Line_Fields_Packet) , CM_no_field) ; 


/*********>*************************************************/ 
/* Spread the data from the starting processor in each 
processor to the rest in the segment. 


Notice the use of the CM_start__bit to indicate that this 
will be a segmented scan. 

Notice also that since this is a copy we can optimize 
performance by only performing one scan instead of 
several, one per field. 


we want to copy all but the alloc_segment_start_Jbit 
and the alloc__saved_context bits because they have 
already been set for each processor. Since they are 
adjacent to each other in CM memory we only need one 
move instead of two. 
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CM_set_vp__set (allocated_vp_set) ; 


CM__u__move__lL(alloc__temp, alloc_saved_context, 2) ; 


CM_jscan_with__copy_lL (alloc__line_packet, 
alloc_line_packet, 

0, CM_TYPELEN (L ine__Fields_Packet) , 
CM_upward, CM_inclusive, 

CM_jstart__bit, 
alloc__segment_start_bit) ; 

CM__u_move_lL(alloc__saved_context, alloc_temp, 2); 


/**********************************************************/ 
/* Compute the location of each pixel. */ 

/* Set the increment for the x coordinate to 1 or -1 at each 
pixel depending on ends_sw */ 


CM_f_move_constant_lL (alloc_temp, (double)1, FLENS); 
CM_load_context (alloc__ends_sw) ; 

CM_f_move_constant_lL(alloc_temp, (double)-1, FLENS); 
CM_load_context(alloc_saved_context); 


/**********************************>************************/ 

/* Compute the increment for the x coordinate at each 
processor in the field alloc_temp. Since we have 
constrained dx >= dy, we can assume that it is the basis of 
the DDA and thus can be increment by 1 or -1 depending */ 

CM_scan_with_f_add_lL (alloc_temp, alloc_jtemp, 0, FLENS, 
CM_upward, CM__exclusive, 

CM_start__bit, 
alloc_segment_start_bit); 
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/**********************************************************/ 
/* Clean up spillover left by the start-bit segmented scan. 
Unfortunately, the segmented scan with add trashes the 
value in the first processor of the next segment. The 
good news is that we can use the segment bit field as a 
source to load into the context to clean up the field */ 

CM_JLoad_context (alloc_segment_start_bit); 

CM_f_move_zero_lL (alloc_temp, FLENS); 

CM_load_context(alloc_saved_context); 


/**********************************************************/ 
/* Add the increment to the x coordinate */ 

CM__f_add_2__lL (alloc_x f alloc_temp, FLENS); 


/**********************************************************/ 
/* Compute the increment for the y coordinate at each 
processor in the field alloc_temp. Since we have 
constrained dx >= dy we can assume that it is the basis of 
the DDA and thus can be increment by 1 or -1 depending */ 


/* Increment y by the slope */ 


CM_scan_with_f_add_lL (alloc_slope, alloc_jslope, 0, 
SLEN, ELEN, 

CM__upward, CM__exclusive, 

CM_start_bit, 
alloc_segment__start_bit) ; 


/* Clean up spillover left by the start-bit segmented scan. 
Unfortunately, the segmented scan with add trashes the 
value in the first processor of the next segment. The 
good news is that we can use the segment bit field as a 
source to load into the context to clean up the field */ 

CM_load_context (alloc_j5egment__start_Jbit) ; 

CM_f_move_zero_JLL (alloc_slope, SLEN, ELEN); 

CM_load__context (alloc_jsaved_context) ; 
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/**********************************************************/ 
/* Add .5 to slope so we can floor it and 
get the integer value */ 


CM_f__add_constant_2_lL (alloc_slope, (double) 0.5, 
SLEN, ELEN); 


/* Add the increment to the y coordinate after converting 
it to an integer value */ 


CM_f_add_2_lL (alloc_y, alloc_slope, FLENS); 

/**********************************************************/ 
/* Reverse X and Y if necessary */ 


CM_load_context (alloc__reverse_p) ; 
CM_swap_2__lL (alloc_x, alloc_y, FLEN) ; 
CM_load_context (alloc__saved__context) ; 


/**********************************************************/ 
/* Draw the data after converting the coordinates to ints */ 

CM_s_f_f loor__2__2L (al loc__temp , alloc_x, 

CM_TYPELEN(short), SLEN, ELEN); 
CM_s_move_lL(alloc_x, alloc_temp, CM_TYPELEN(short)); 


CM_s_f_f loor_2_2L (alloc__temp , alloc_y, 

CMJTYPELEN(short), SLEN, ELEN); 
CM_s_move_JLL(alloc__y, alloc_temp, CMJTYPELEN(short)) ; 

draw_point (image, alloc_x, alloc_y, alloc_color, 
CM_TYPELEN(short) , colorJLength) ; 


/h*********************************************************/ 

/* Clean up */ 

CM_jset_vp_set (line__vp_set); 

CM_deallocate_stack_through (line_packet); 
CM_deallocate_vp_jset (allocated_vp_set) ; 
CM_deallocate_geometry (allocated_geometry); 
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} 


/*The following file is the main file that uses draw-line in a 
real program and displays the results on the Connection Machine 
graphics display system. Essentially, it creates two vp-sets, 
one for the line vertices and one for the resultant image. 

*/ 


Example 31. Displaying the results of the line-drawing procedure: draw-line-main.c 


#include <stdio.h> 

#include <cm/paris.h> 

#include <cm/cmfb.h> 

#include "macros-and-constants.h" 

#define IMAGEJDEPTH (unsigned)8 

/**********************************************************/ 
#define DISPLAY_IMAGE(display,buffer,image,argl,arg2,clear) \ 

{ \ 

CM__vp_set__id__t current, image_vp__set; \ 
current = CM__current_vp__set; \ 
image__vp_set = CM_field_vp_set(image); \ 

CM_set_vp__set (image_vp_set) ; \ 

CMFB___write_always (display, buffer, image, argl, arg2) ; \ 
if (clear) \ 

CM_u_move_zero_lL (image, clear); \ 

CM_set__vp_set (current) ; \ 

> 

/**********************************************************/ 


main(argc,argv) 
char *argv[]; 

{ 

unsigned line_dimensions[1], image_dimensions[2]; 
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CM__vp_set_id_t 1 ine__vp_set, image_vp_set; 

CM_geometry_id_t line_geometry, image_geometry; 

CM_field_id_t 
image_field, 
sx, sy, ex, ey, color, 
fsx, fsy, fex, fey, 
send__address; 
unsigned 

coord_length, image_edge, sal, nl, zoom; 

struct CMFB_display_id display; 

/**********************************************************/ 
if (argc > 1) 

image_edge = atoi(argv[1]); 
else 

image_edge = 256; 

f**********************************************************/ 

printf (’’Warm booting the CM ...’’); fflush(stdout) ; 

CM_init() ; 
printf(”Done\n"); 

/**********************************************************/ 

/* set up the framebuffer for use */ 

printf("Attaching and Initializing the Display ... Rainbow 
palette ...”); 

fflush(stdout); 

zoom = 1024/image__edge - 1; 

CMFB_a11 ach_disp1ay(NULL,&disp1ay); 

CMFB_initialize_display(^display,IMAGE_DEPTH,1); 
CMFB_set_zoom(&display,zoom,zoom,0) ; 
CMFB_set_color_table__rainbow (&display, 

(double) 1.0 /* red_freq */, 

(double) 1.0 /* green_freq */, 

(double) 1.0 /* blue__freq */, 

(double) 0.0 /* red__phase */, 

(double) 0.33333 /* green__phase */» 

(double) 0.66666 /* blue_phase */, 

(double) 1.0 /* red_ampl */, 

(double) 1.0 /* green__ampl */, 
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(double) 1.0 /* blue_ampl */, 

(unsigned) 0 /*include_first__index*/) ; 
printf( M Done\n"); 

/**********************************************************/ 
/* Initialize image regalia */ 

image_dimensions[0] = image_edge; 
image_dimensions[1] = image__edge; 

image_geometry = CM_create__geometry(image_dimensions, 2); 
image__vp_set = CM_allocate_vp_jset(image__geometry); 
CM_jset__vp_set (image__vp_jset) ; 


image_field = CM_allocate_stack_field(IMAGE_DEPTH) ; 
CM_u__move_zero__lL (image_field, IMAGE_DEPTH) ; 

/*X*********************X***X****************X*************/ 

/* Initialize line regalia to be the size of the physical machine 
*/ 


line_dimensions [0] = CM_physical__cube_address_limit; 
line_geometry = CM__create_geometry(line_dimensions v 1); 
line_vp_set = CM_allocate__vp_set (line__geometry) ; 
CM_set_vp_set (line_vp_set); 


/* Determine the length in bits of a coordinate in our image 
vp-set. We need to add one for a sign bit */ 

coord_length = CM__geometry_coordinate^length(image_geometry, 

0) + 1; 

/* Determine the size of the send address length in the line 
vp-set */ 

sal = CM_geometry_send__address__length(line_geometry) ; 


sx = CM_ailocate_stack_field(coord_length) 
sy = CM_allocate_stack_f ield(coord__length) 
ex = CM_allocate_jstack_f ield (coord_length) 
ey = CM_allocate_stack_f ield(coord_length) 
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fsx = CM_allocate_stack_field(FLEN) 
fsy = CM_allocate_stack_field(FLEN) 
fex = CM_allocate_stack_field(FLEN) 
fey = CM_allocate_stack__f ield(FLEN) 


color = CM_allocate_stack_f ield(IMAGE_DEPTH) ; 


send_address = CM_allocate_stack_field(sal) ; 


/* set the vertices of each line to be radial. The starting 
point of each */ 

/* line is the center of the screen (actually the 2-d image vp- 
set). The */ 

/* end point of each is a unique point along the perimeter of 
the screen. */ 

set_radial_lines(sx, sy, ex, ey, color, send_address, 
coord_length, IMAGEJDEPTH, sal, 
image_dimensions[0]); 


CM_f_u__float_2__2L(fsx, sx, coord_length, FLENS) 
CM_f_u_f loat__2_2L (f sy, sy, coord_length, FLENS) 
CM_f_u__f loat_2__2L (fex, ex, coord_length, FLENS) 
CM_f__u__f loat__2_2L (fey, ey, coord_length, FLENS) 


/* Draw lines one at a time */ 


printf("Draw lines one at a time\n M ); 

for (nl=0; nl< image_dimensions [0] «2; nl+=image_dimen- 
sions[0]»4) { 

CM_set_context(); 

CM_u_eq_c on stant_lL (send_address, nl, sal); 
CM_logand_context__with_test (); 

draw_line (image__f ield, fsx, fsy, fex, fey, color, 
IMAGE_DEPTH); 

DISPLAY_IMAGE (&di splay,CMFB_current_buffer(^display), 
image_field,0,0,IMAGE JDEPTH); 


} 
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/**********************************************************/ 
/* Draw lines in parallel */ 

printf (’’Draw lines in parallel\n") ; 

CM__set__context () ; 

CM_u_lt_constant_lL (send_address, 4*image_dimensions[0], 
sal) ; 

CM_logand_context__with__test () ; 

draw_JLine (image_field, fsx, fsy, fex, fey, color, 

IMAGEJDEPTH); 

DISPLAY__IMAGE (&display, CMFB_current Jbuffer (&display) , 
image^field,0,0,IMAGE JDEPTH); 


y'********************************************************** i 

CM__deallocate_stack_jthrough(sx); 


} 

/* This function sets the first 4*image_size processors to have 
start vertices in the center of the screen and the end points 
to 

the perimeter of the rectanglar area defined by 
image_sizeximage_size. the result will draw radial if the 
vertices are passed to draw_line. color is also set up. tmp is 
paseed in because it is useful outside the this procedure */ 


set_radial_lines(sx, sy, ex, ey, color, tmp, 

coord__length, color_length, tmp_Jength, image__size) 
CM_field__id_t sx, sy, ex, ey, color, tmp; 

unsigned coord__length, color_length, tmp_length, image_jsize; 

{ 


CM_set_context(); 

CM_ u _ mov e_ zero __lL 
CM _ u jnove__zero_JL 
CM_u_move_zero__lL 
CM_u_mo ve_zero__lL 
CM u move zero 1L 


(sx, coord_length) ; 
(sy, coord_length); 
(ex, coord_Jength); 
(ey, coord__length) ; 
(tmp, tmp_length); 




Appendix G: Drawing Lines 


CM_my__news__coordinate_JLL (tmp, 0, tmp__length) ; 


/**********************************************************/ 
/* initialize color */ 

CM_u__add_J3__lL (color, tmp, tmp, color__length) ; 
CM__u_eq_constant__lL(color,0,color__length) ; 
CM__logand_context_with_test(); 

CM_u__move__constant_lL(color, 1,color_length) ; 


CM_jset_context () ; 


/**********************************************************/ 

/* set starting points to middle of screen */ 
CM_u_mo v e_constant__lL (sx, image_size»l, coord_length) ; 
CM_ u _mo ve _ c onst an t_lL (sy, image_size»l, coord_length) ; 

/* set end points of 135 thru 45 deg lines */ 
CM_u_lt_constant__lL (tmp, image_size, tmp_length) ; 
CM_JLogand_context_with_test () ; 

CM__u_mo v e_lL (ex, tmp, coord__length) ; 

CM_u__mo v e_constant_lL (ey, 0, coord_length) ; 

/**********************************************************/ 

/* set 44 to 315 end points */ 

CM_set_context(); 

CM_u_ge__constant_lL (tmp, image_size, tmp__length) ; 
CM_JLogand_context_with_test() ; 

CM__u_lt_constant_lL (tmp, 2*image_size, tmp_length) ; 
CM_logand_context_with__test () ; 

CM__u_m° v e__constant_lL (ex, image_size - 1, coord_length) ; 
CM__u_mo v e_lL (ey, tmp, coord_length); 

CM__u_subtract_constant_2_lL (ey, image__size , coord__length) ; 

/**********************************************************/ 

/* set 314 to 225 end points */ 

CM_set_context(); 

CM_u_ge__constant_lL (tmp, 2*image_size, tmp_length) ; 
CM__logand_context__with_test () ; 

CM_u_lt_constant_lL (tmp, 3*image_size, tmp_length); 
CM__logand__context_with_test () ; 

CM_u_mo v e_constant_lL (ey, image_jsize - 1, coord__length); 
CM_u_mo v e_lL (ex, tmp, coord_length); 

CM__u__subtract_constant_2_lL (ex, 2*image_size, coord_length) ; 
CM_u_subfrom__constant_2_lL(ex, image_size - 1,coord_length) 

/* set 224 to 136 end points */ 

CM_jset_context () ; 



188 


Getting Started in C/Paris 


CM_u_g e _j 2 onsta n t_lL (tmp, 3*image_size, tmp_length); 
CM_logand_context_with_test() ; 

CM_u_lt_constant_lL (tmp, 4*image__size, tmp_length) ; 
CM_logand_context__with__test () ; 

CM _ujnove_constant_JLL (ex, 0, coord__length) ; 

CM_u_m° v e_lL (ey, tmp, coord_length); 

CM_u_s u btract_c on stant_2__lL( e y, 3*image__size,coord__length) ; 
CM_u_subfrom_constant_2_lL (ey, image__size - 1, coord_JLength) ; 


CM_set_context(); 


} 
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